Difference between revisions of "SAEC notes"

(R cheat sheet)
(R cheat sheet)
Line 53: Line 53:
 
Apply function summary to column val where those values are grouped according by the column label
 
Apply function summary to column val where those values are grouped according by the column label
 
  tapply(x$val, x$label, summary)
 
  tapply(x$val, x$label, summary)
 +
 +
get headers from a vector:
 +
attributes(vector)
 +
 +
Line plot
 +
plot(x, type="l")
  
 
== bash cheat sheet ==
 
== bash cheat sheet ==

Revision as of 11:53, 30 October 2007

test file collection | Comments on geWorkbench | SAEC notes | SAEC protocol | SAEC executive summary | Other


SAEC

Questions

- what does ADR stand for?

- summary statistics: please define!

- where do we get the chromosomal position from? 5.4.4.

- what are "four hapmap samples" 5.4.7

In the Illumina 1 M chip there are 890 probes with the same name there are 12 probes with the same position and chromosome and they have different names

1M chip contains rs*, SNP*, cnv*, and Mito* snp names

Classes

cases 71 controls 141 6 not matched to cases => 135 matched to case 7 failed to match the 5 year of age criteria => 128 controls

female 68% male 32%

Caucasian 80% Asian 4% black 8% american hispanic 5%

close relatives: ten families 9 of which with one pair of related subjects (1st and 2nd degree relatives) 1 with five related subjects

history of drug allergy 42% (31) cases 13% (18) controls

use of medication within 12 weeks controls: 77% cases: all

combination of trimethoprim and sulfamethoxazole: 21 cases

phenytoin 10 cases

R cheat sheet

imiss <- read.table("z:////", header=TRUE)
imiss['N_MISS']
bmp(filename = "Rplot%03d.bmp", width = 480, height = 480,
        pointsize = 12, bg = "white", res = NA)
hist(imiss$N_MISS, main="main header", sub="sub header", xlab="x-achsis", ylab="y-achis", labels=T)
dev.off()

from a frame get rows where the column lable has a specific value:

subset(x x$label == "xxx")$column

Apply function summary to column val where those values are grouped according by the column label

tapply(x$val, x$label, summary)

get headers from a vector:

attributes(vector)

Line plot

plot(x, type="l")

bash cheat sheet

strategy summary

four groups of data: + all + whites only + whites SJS + whites TEN

per subject call rates

<PLINK> --missing To generate a list genotyping/missingness rate statistics: plink --file data --missing

This option creates two files:

 => plink.imiss <=
    plink.lmiss 

which detail missingness by individual and by SNP. For individuals, the format is:

    FID                Family ID
    IID                Individual ID
    MISS_PHENO         Missing phenotype? (Y/N)
 => N_MISS             Number of missing SNPs      <=
    F_MISS             Proportion of missing SNPs


=> summary stats, histogram

=> EXCLUDE subjects with call rates below 90% --mind 0.10

The initial step in all data analysis is to exclude individuals with too much missing genotype data. This option is set as follows: plink --file mydata --mind 0.1

which means exclude with more than 10% missing genotypes (this is the defalt value). A line in the terminal output will appear, indicating how many individuals were removed due to low genotyping. If any individuals were removed, a file called

    plink.irem     

will be created, listing the Family and Individual IDs of these removed individuals. Any subsequent analysis also specifeid on the same command line will be performed without these individuals.

One might instead wish to create a new PED file with these individuals permanently removed, simply add an option to generate a new fileset: for example, plink --file data --mind 0.1 --recode --out cleaned

will generate files

    cleaned-recode.ped
    cleaned-recode.map

with the high-missing-rate individuals removed; alternatively, to create a binary fileset with these individuals removed: plink --file data --mind 0.1 --make-bed --out cleaned

which results in the files

    cleaned.bed
    cleaned.bim
    cleaned.fam


sample relatedness

<PLINK> --genome

plink --file mydata --genome

which creates the file

    plink.genome	

which has the following fields:

    FID1      Family ID for first individual
    IID1      Individual ID for first individual
    FID2      Family ID for second individual
    IID2      Individual ID for second individual
    Z0        P(IBD=0)
    Z1        P(IBD=1)
    Z2        P(IBD=2)

=> PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )

    IBS0      Number of IBS 0 nonmissing loci
    IBS1      Number of IBS 1 nonmissing loci
    IBS2      Number of IBS 2 nonmissing loci
    DST       IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
    P         IBS binomial test
    HOMHOM    Number of IBS 0 SNP pairs used in test
    HETHET    Number of IBS 2 het/het SNP pairs in test
    RATIO     R in ratio of 1:2 for IBS 0 : HETHET 

This file will have as many rows as there are unique pairs of individuals in the sample -- for large samples with thousands of individuals, this file can be very large (and take considerable time to generate).

HINT To reduce the file size, use the --minX option to only output to plink.genome pairs where PI_HAT is greater than X. That is, plink --file mydata --genome --min 0.05


" When excluding sample genotypes due to duplication/relatedness, preference be given to 1) the sample with higher call rates and 2) ADR cases. Relatedness exclusions should be carried out to remove any first or second degree relatives (PI-HAT ˜ 0.5 and 0.25, respectively, i.e. all PI-HAT > 0.2)."

=> --min 0.2 => (manually) identify samples that should be removed by comparing/listing call rates (step 1) and ?ADR? cases => manually remove sample from original file and repeat steps to get here


concordance with reported sex

<PLINK> --check-sex

This option uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data). To run this analysis, use the flag: plink --bfile data --check-sex

which generates a file

    plink.sexcheck

which contains the fields

    FID     Family ID
    IID     Individual ID
    PEDSEX  Sex as determined in pedigree file (1=male, 2=female)
    SNPSEX  Sex as determined by X chromosome
    STATUS  Displays "PROBLEM" or "OK" for each individual
    F       The actual X chromosome inbreeding (homozygosity) estimate

A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a femle call is made if F is less than 0.2.

=> Follow up with GSK!!!


Hardy Weinberg equilibrium

Apply to + all white + case white + control white

(this is done automatically, if case/control set correctly)

<PLINK> --hardy

plink --file data --hardy

which creates a file:

    plink.hwe

This file has the following format

    SNP             SNP identifier
    TEST            Code indicating sample
    GENO            Genotype counts: 11/12/22 
    O(HET)          Observed heterozygosity
    E(HET)          Expected heterozygosity
    P_HWD           H-W p-value

For case/control samples, each SNP will have three entries (rows) in this file, with TEST being either ALL, AFF (cases only) or UNAFF (controls only). For quantitative traits, only a single row will appear for each SNP, labelled ALL(QT).

Only founders are considered for the Hardy-Weinberg calculations -- ie. for family data, any offspring are ignored.

WARNING By default, this procedure only considers founders, so no HW results would be given for sibling-only datasets (i.e. if no parents exist). To perform a rough, somewhat biased test, use the --nonfounders option which means that all individuals will be included. Alternatively, manually extract one person per family for this calculation and recode these individuals as founders (see the --keep option to facilitate this).

The default test is an exact one, described and implemented by Wigginton et al (see reference below), which is more accurate for rare genotypes. You can still perform the standard asymptotic test with the --hardy2 option.

=> summary statistics => histograms / density plots => QQ plot of distributions of P_HWD (use qqplot in Matlab) => i.e. qq plot between all three data sets? => P_HWD vs chromosomal positions => chromosomal positions has to come from somewhere... Since there is a large number of chrom. positions maybe that should be done through a database query???

Concordance

Concordance is defined as the presence or absence of the same genotype for the same SNP measured in the same individual from two sources. Concordance is typically expressed as the fraction of SNPs genotyped that are concordant for each individual or as the fraction of individuals that are concordant for each SNP. The following concordance checks are proposed: • Concordance between Illumina 1M genotyped duplicates • Concordance between Illumina 1M and Affymetrix 500K for all overlapping SNPs • Optional: Concordance between Illumina 1M calls and those calls predicted by imputation; method to be determined


=> We need information about duplicate SNPs within Illumina 1M => do we have Affymetrix 500K data? => optional ????


Patterns of missingness

<PLINK> --cluser-missing <PLINK> --test-missing

cluser=missing

Systematic batch effects that induce missingness in parts of the sample will induce correlation between the patterns of missing data that different individuals display. One approach to detecting correlation in these patterns, that might possibly idenity such biases, is to cluster individuals based on their identity-by-missingness (IBM). This approach use exactly the same procedure as the IBS clustering for population stratification, except the distance between two individuals is based not on which (non-missing) allele they have at each site, but rather the proportion of sites for which two individuals are both missing the same genotype.

To use this option: plink --file data --cluster-missing

which creates the files:

    plink.matrix.missing
    plink.cluster3.missing

which have similar formats to the corresponding IBS clustering files. Specifically, the plink.mdist.missing file can be subjected to a visualisation technique such as multidimensinoal scaling to reveal any strong systematic patterns of missingness.

Note The values in the .mdist file are distances rather than similarities, unlike for standard IBS clustering. That is, a value of 0 means that two individuals have the same profile of missing genotypes. The exact value represents the proportion of all SNPs that are discordantly missing (i.e. where one member of the pair is missing that SNP but the other individual is not).

The other constraints (significance test, phenotype, cluster size and external matching criteria) are not used during IBM clustering. Also, by default, all individuals and all SNPs are included in an IBM clustering analysis, unlike IBS clustering, i.e. even individuals or SNPs with very low genotyping, or monomorphic alleles. By explicitly specifying --mind or --geno or --maf certain individuals or SNPs can be excluded (although the default is probably what is usually required for quality control procedures).

The *.cluster3 file is in the same format as cluster2 (one line per individual) but contains all solutions (i.e. every step of the clustering from moving from N clusters each of 1 individual (leftmost column after family and individual ID) to 1 cluster (labelled 0) containing all N individuals (the final, rightmost column): also shown is the dendrogram this represents: e.g.

    A 1 0 0 0 0
    B 1 1 1 1 0
    C 1 2 1 1 0
    D 1 3 2 0 0



NOTE If any constraints have been placed upon the clustering, then solutions represented in the *.cluster3 file may not go as far as 1 cluster with all N individuals: in this case, the file *.cluster2 will contain the final solution (i.e. as far as the clustering could go before running up against constraints, e.g. based on maximum cluster size, etc).

HINT! In large samples, cluster analyses can be very slow. Often the most time consuming step is calculating the pairwise IBS metrics: these only need to be calculated once however, even if you want to run the cluster analysis multiple times (e.g. with different constraints). This is achieved with the --read-genome option, assuming you have previously run the --genome command. It is a good idea to not impose a threshold of the --genome output in this case. For example: plink --bfile mydata --genome --out mydata

followed by multiple clustering commands (see below for descriptions of the cluster constraint parameters used here) plink --bfile mydata --read-genome mydata.genome --cluster --ppc 0.01

and plink --bfile mydata --read-genome mydata.genome --cluster --mc 2 --ibm 0.01

etc.

ADVANCED HINT! In very large samples, cluster analyses can be very, very slow. When calculating the plink.genome file (as described above), if you have access to a cluster of computers for parallel computing, you can use the following approach to greatly reduce the time this step takes.


test-missing

To obtain a missing chi-sq test (i.e. does, for each SNP, missingness differ between cases and controls?), use the option: plink --file mydata --test-missing

which generates a file

    plink.missing

which contains the fields

    CHR         Chromosome number
    SNP         SNP identifier
    F_MISS_A    Missing rate in cases
    F_MISS_U    Missing rate in controls
    CHISQ_MISS  Chi-square (1df) for missingness test
    P_MISS      Asymptotic p-value

The actual counts of missing genotypes are available in the plink.lmiss file, which is generated by the --missing option.

Note This test is only applicable to case/control data.


=> histogram => QQ plots

=> further investigation

?? => differential missing by plate ?? => differential missing by hybridization date


Genetic clustering

Illumina 1M genotype data