Difference between revisions of "SAEC notes"
|  (→sample relatedness) |  (→R cheat sheet) | ||
| Line 41: | Line 41: | ||
| == R cheat sheet == | == R cheat sheet == | ||
| + | free up memory | ||
| + |  rm(list=ls(all=TRUE)) | ||
| + | |||
|   imiss <- read.table("z:////", header=TRUE) |   imiss <- read.table("z:////", header=TRUE) | ||
|   imiss['N_MISS'] |   imiss['N_MISS'] | ||
Revision as of 11:17, 5 November 2007
| test file collection | Comments on geWorkbench | SAEC notes | SAEC protocol | SAEC executive summary | Other | 
Contents
SAEC
Questions
- what does ADR stand for?
- summary statistics: please define!
- where do we get the chromosomal position from? 5.4.4.
- what are "four hapmap samples" 5.4.7
In the Illumina 1 M chip there are 890 probes with the same name there are 12 probes with the same position and chromosome and they have different names
1M chip contains rs*, SNP*, cnv*, and Mito* snp names
Classes
cases 71 controls 141 6 not matched to cases => 135 matched to case 7 failed to match the 5 year of age criteria => 128 controls
female 68% male 32%
Caucasian 80% Asian 4% black 8% american hispanic 5%
close relatives: ten families 9 of which with one pair of related subjects (1st and 2nd degree relatives) 1 with five related subjects
history of drug allergy 42% (31) cases 13% (18) controls
use of medication within 12 weeks controls: 77% cases: all
combination of trimethoprim and sulfamethoxazole: 21 cases
phenytoin 10 cases
R cheat sheet
free up memory
rm(list=ls(all=TRUE))
imiss <- read.table("z:////", header=TRUE)
imiss['N_MISS']
bmp(filename = "Rplot%03d.bmp", width = 480, height = 480,
        pointsize = 12, bg = "white", res = NA)
hist(imiss$N_MISS, main="main header", sub="sub header", xlab="x-achsis", ylab="y-achis", labels=T)
dev.off()
getting information about variables:
typeof(x) attributes(x) Levels(drug) This effectively gives you the unique entries in the factor, each of which has been given a level As.numeric(drug) This gives you the integer storage code for each element of the factor table(drug) To create a frequency *table* of the occurrences of your factor use
from a frame get rows where the column lable has a specific value:
subset(x x$label == "xxx")$column
Apply function summary to column val where those values are grouped according by the column label
tapply(x$val, x$label, summary)
get headers from a vector:
attributes(vector)
Line plot
plot(x, type="l")
data frames
names(df) => headers
bash cheat sheet
other stuff
Convert an image format
convert Age-All.pdf -rotate 90 Age-All.png
bash:
for i in *.ps; do convert $i -rotate 90 $i.png ; done
for i in *.ps; do convert $i -rotate 90 ${i%.ps}.png ; done
strategy summary
four groups of data: + all + whites only + whites SJS + whites TEN
per subject call rates
<PLINK> --missing To generate a list genotyping/missingness rate statistics: plink --file data --missing
This option creates two files:
 => plink.imiss <=
    plink.lmiss 
which detail missingness by individual and by SNP. For individuals, the format is:
    FID                Family ID
    IID                Individual ID
    MISS_PHENO         Missing phenotype? (Y/N)
 => N_MISS             Number of missing SNPs      <=
    F_MISS             Proportion of missing SNPs
=> summary stats, histogram 
=> EXCLUDE subjects with call rates below 90% --mind 0.10
The initial step in all data analysis is to exclude individuals with too much missing genotype data. This option is set as follows: plink --file mydata --mind 0.1
which means exclude with more than 10% missing genotypes (this is the defalt value). A line in the terminal output will appear, indicating how many individuals were removed due to low genotyping. If any individuals were removed, a file called
plink.irem
will be created, listing the Family and Individual IDs of these removed individuals. Any subsequent analysis also specifeid on the same command line will be performed without these individuals.
One might instead wish to create a new PED file with these individuals permanently removed, simply add an option to generate a new fileset: for example, plink --file data --mind 0.1 --recode --out cleaned
will generate files
    cleaned-recode.ped
    cleaned-recode.map
with the high-missing-rate individuals removed; alternatively, to create a binary fileset with these individuals removed: plink --file data --mind 0.1 --make-bed --out cleaned
which results in the files
    cleaned.bed
    cleaned.bim
    cleaned.fam
<PLINK> --genome
plink --file mydata --genome
which creates the file
plink.genome
which has the following fields:
    FID1      Family ID for first individual
    IID1      Individual ID for first individual
    FID2      Family ID for second individual
    IID2      Individual ID for second individual
    Z0        P(IBD=0)
    Z1        P(IBD=1)
    Z2        P(IBD=2)
=>   PI_HAT    P(IBD=2)+0.5*P(IBD=1) ( proportion IBD )
    IBS0      Number of IBS 0 nonmissing loci
    IBS1      Number of IBS 1 nonmissing loci
    IBS2      Number of IBS 2 nonmissing loci
    DST       IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs )
    P         IBS binomial test
    HOMHOM    Number of IBS 0 SNP pairs used in test
    HETHET    Number of IBS 2 het/het SNP pairs in test
    RATIO     R in ratio of 1:2 for IBS 0 : HETHET 
This file will have as many rows as there are unique pairs of individuals in the sample -- for large samples with thousands of individuals, this file can be very large (and take considerable time to generate).
HINT To reduce the file size, use the --minX option to only output to plink.genome pairs where PI_HAT is greater than X. That is, plink --file mydata --genome --min 0.05
" When excluding sample genotypes due to duplication/relatedness, preference be given to 1) the sample with higher call rates and 2) ADR cases.  Relatedness exclusions should be carried out to remove any first or second degree relatives (PI-HAT ˜ 0.5 and 0.25, respectively, i.e. all PI-HAT > 0.2)."
=> --min 0.2 => (manually) identify samples that should be removed by comparing/listing call rates (step 1) and ?ADR? cases => manually remove sample from original file and repeat steps to get here
concordance with reported sex
<PLINK> --check-sex
This option uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data). To run this analysis, use the flag: plink --bfile data --check-sex
which generates a file
plink.sexcheck
which contains the fields
    FID     Family ID
    IID     Individual ID
    PEDSEX  Sex as determined in pedigree file (1=male, 2=female)
    SNPSEX  Sex as determined by X chromosome
    STATUS  Displays "PROBLEM" or "OK" for each individual
    F       The actual X chromosome inbreeding (homozygosity) estimate
A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a femle call is made if F is less than 0.2.
=> Follow up with GSK!!!
Hardy Weinberg equilibrium
Apply to + all white + case white + control white
(this is done automatically, if case/control set correctly)
<PLINK> --hardy
plink --file data --hardy
which creates a file:
plink.hwe
This file has the following format
    SNP             SNP identifier
    TEST            Code indicating sample
    GENO            Genotype counts: 11/12/22 
    O(HET)          Observed heterozygosity
    E(HET)          Expected heterozygosity
    P_HWD           H-W p-value
For case/control samples, each SNP will have three entries (rows) in this file, with TEST being either ALL, AFF (cases only) or UNAFF (controls only). For quantitative traits, only a single row will appear for each SNP, labelled ALL(QT).
Only founders are considered for the Hardy-Weinberg calculations -- ie. for family data, any offspring are ignored.
WARNING By default, this procedure only considers founders, so no HW results would be given for sibling-only datasets (i.e. if no parents exist). To perform a rough, somewhat biased test, use the --nonfounders option which means that all individuals will be included. Alternatively, manually extract one person per family for this calculation and recode these individuals as founders (see the --keep option to facilitate this).
The default test is an exact one, described and implemented by Wigginton et al (see reference below), which is more accurate for rare genotypes. You can still perform the standard asymptotic test with the --hardy2 option.
=> summary statistics => histograms / density plots => QQ plot of distributions of P_HWD (use qqplot in Matlab) => i.e. qq plot between all three data sets? => P_HWD vs chromosomal positions => chromosomal positions has to come from somewhere... Since there is a large number of chrom. positions maybe that should be done through a database query???
Concordance
Concordance is defined as the presence or absence of the same genotype for the same SNP measured in the same individual from two sources. Concordance is typically expressed as the fraction of SNPs genotyped that are concordant for each individual or as the fraction of individuals that are concordant for each SNP. The following concordance checks are proposed: • Concordance between Illumina 1M genotyped duplicates • Concordance between Illumina 1M and Affymetrix 500K for all overlapping SNPs • Optional: Concordance between Illumina 1M calls and those calls predicted by imputation; method to be determined
=> We need information about duplicate SNPs within Illumina 1M
=> do we have Affymetrix 500K data?
=> optional ????
Patterns of missingness
<PLINK> --cluser-missing <PLINK> --test-missing
cluser=missing
Systematic batch effects that induce missingness in parts of the sample will induce correlation between the patterns of missing data that different individuals display. One approach to detecting correlation in these patterns, that might possibly idenity such biases, is to cluster individuals based on their identity-by-missingness (IBM). This approach use exactly the same procedure as the IBS clustering for population stratification, except the distance between two individuals is based not on which (non-missing) allele they have at each site, but rather the proportion of sites for which two individuals are both missing the same genotype.
To use this option: plink --file data --cluster-missing
which creates the files:
    plink.matrix.missing
    plink.cluster3.missing
which have similar formats to the corresponding IBS clustering files. Specifically, the plink.mdist.missing file can be subjected to a visualisation technique such as multidimensinoal scaling to reveal any strong systematic patterns of missingness.
Note The values in the .mdist file are distances rather than similarities, unlike for standard IBS clustering. That is, a value of 0 means that two individuals have the same profile of missing genotypes. The exact value represents the proportion of all SNPs that are discordantly missing (i.e. where one member of the pair is missing that SNP but the other individual is not).
The other constraints (significance test, phenotype, cluster size and external matching criteria) are not used during IBM clustering. Also, by default, all individuals and all SNPs are included in an IBM clustering analysis, unlike IBS clustering, i.e. even individuals or SNPs with very low genotyping, or monomorphic alleles. By explicitly specifying --mind or --geno or --maf certain individuals or SNPs can be excluded (although the default is probably what is usually required for quality control procedures).
The *.cluster3 file is in the same format as cluster2 (one line per individual) but contains all solutions (i.e. every step of the clustering from moving from N clusters each of 1 individual (leftmost column after family and individual ID) to 1 cluster (labelled 0) containing all N individuals (the final, rightmost column): also shown is the dendrogram this represents: e.g.
    A 1 0 0 0 0
    B 1 1 1 1 0
    C 1 2 1 1 0
    D 1 3 2 0 0
  
NOTE If any constraints have been placed upon the clustering, then solutions represented in the *.cluster3 file may not go as far as 1 cluster with all N individuals: in this case, the file *.cluster2 will contain the final solution (i.e. as far as the clustering could go before running up against constraints, e.g. based on maximum cluster size, etc). 
HINT! In large samples, cluster analyses can be very slow. Often the most time consuming step is calculating the pairwise IBS metrics: these only need to be calculated once however, even if you want to run the cluster analysis multiple times (e.g. with different constraints). This is achieved with the --read-genome option, assuming you have previously run the --genome command. It is a good idea to not impose a threshold of the --genome output in this case. For example: plink --bfile mydata --genome --out mydata
followed by multiple clustering commands (see below for descriptions of the cluster constraint parameters used here) plink --bfile mydata --read-genome mydata.genome --cluster --ppc 0.01
and plink --bfile mydata --read-genome mydata.genome --cluster --mc 2 --ibm 0.01
etc.
ADVANCED HINT! In very large samples, cluster analyses can be very, very slow. When calculating the plink.genome file (as described above), if you have access to a cluster of computers for parallel computing, you can use the following approach to greatly reduce the time this step takes.
test-missing
To obtain a missing chi-sq test (i.e. does, for each SNP, missingness differ between cases and controls?), use the option: plink --file mydata --test-missing
which generates a file
plink.missing
which contains the fields
    CHR         Chromosome number
    SNP         SNP identifier
    F_MISS_A    Missing rate in cases
    F_MISS_U    Missing rate in controls
    CHISQ_MISS  Chi-square (1df) for missingness test
    P_MISS      Asymptotic p-value
The actual counts of missing genotypes are available in the plink.lmiss file, which is generated by the --missing option.
Note This test is only applicable to case/control data.
=> histogram
=> QQ plots 
=> further investigation
?? => differential missing by plate ?? => differential missing by hybridization date
Genetic clustering
Illumina 1M genotype data
