Difference between revisions of "SAEC notes"
(→sample relatedness) |
(→data frames) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
1M chip contains rs*, SNP*, cnv*, and Mito* snp names | 1M chip contains rs*, SNP*, cnv*, and Mito* snp names | ||
+ | |||
+ | identity-by-missingness (IBM). | ||
== Classes == | == Classes == | ||
Line 41: | Line 43: | ||
== R cheat sheet == | == R cheat sheet == | ||
+ | |||
+ | order data frames: | ||
+ | see http://wiki.r-project.org/rwiki/doku.php?id=tips:data-frames:sort | ||
+ | |||
+ | free up memory | ||
+ | rm(list=ls(all=TRUE)) | ||
+ | |||
imiss <- read.table("z:////", header=TRUE) | imiss <- read.table("z:////", header=TRUE) | ||
imiss['N_MISS'] | imiss['N_MISS'] | ||
Line 69: | Line 78: | ||
==== data frames ==== | ==== data frames ==== | ||
names(df) => headers | names(df) => headers | ||
+ | myDF <- data.frame(x = rnorm(10), y = rnorm(10)) # Create a data frame | ||
+ | myDF # View its content | ||
+ | myDF[1, 2] # Extract first element of second column | ||
+ | myDf[, 2] # Extract ALL elements from second column | ||
+ | myDF[[2]] # A list extraction also works on data frames | ||
+ | myDF[, "y"] # Idem | ||
+ | myDF$y # Same again | ||
+ | myDF[myDF$x > 0, ] # A conditional selection | ||
== bash cheat sheet == | == bash cheat sheet == |
Latest revision as of 22:36, 30 November 2007
test file collection | Comments on geWorkbench | SAEC notes | SAEC protocol | SAEC executive summary | Other |
Contents
SAEC
Questions
- what does ADR stand for?
- summary statistics: please define!
- where do we get the chromosomal position from? 5.4.4.
- what are "four hapmap samples" 5.4.7
In the Illumina 1 M chip there are 890 probes with the same name there are 12 probes with the same position and chromosome and they have different names
1M chip contains rs*, SNP*, cnv*, and Mito* snp names
identity-by-missingness (IBM).
Classes
cases 71 controls 141 6 not matched to cases => 135 matched to case 7 failed to match the 5 year of age criteria => 128 controls
female 68% male 32%
Caucasian 80% Asian 4% black 8% american hispanic 5%
close relatives: ten families 9 of which with one pair of related subjects (1st and 2nd degree relatives) 1 with five related subjects
history of drug allergy 42% (31) cases 13% (18) controls
use of medication within 12 weeks controls: 77% cases: all
combination of trimethoprim and sulfamethoxazole: 21 cases
phenytoin 10 cases
R cheat sheet
order data frames:
see http://wiki.r-project.org/rwiki/doku.php?id=tips:data-frames:sort
free up memory
rm(list=ls(all=TRUE))
imiss <- read.table("z:////", header=TRUE) imiss['N_MISS'] bmp(filename = "Rplot%03d.bmp", width = 480, height = 480, pointsize = 12, bg = "white", res = NA) hist(imiss$N_MISS, main="main header", sub="sub header", xlab="x-achsis", ylab="y-achis", labels=T) dev.off()
getting information about variables:
typeof(x) attributes(x) Levels(drug) This effectively gives you the unique entries in the factor, each of which has been given a level As.numeric(drug) This gives you the integer storage code for each element of the factor table(drug) To create a frequency *table* of the occurrences of your factor use
from a frame get rows where the column lable has a specific value:
subset(x x$label == "xxx")$column
Apply function summary to column val where those values are grouped according by the column label
tapply(x$val, x$label, summary)
get headers from a vector:
attributes(vector)
Line plot
plot(x, type="l")
data frames
names(df) => headers myDF <- data.frame(x = rnorm(10), y = rnorm(10)) # Create a data frame myDF # View its content myDF[1, 2] # Extract first element of second column myDf[, 2] # Extract ALL elements from second column myDF2 # A list extraction also works on data frames myDF[, "y"] # Idem myDF$y # Same again myDF[myDF$x > 0, ] # A conditional selection
bash cheat sheet
other stuff
Convert an image format
convert Age-All.pdf -rotate 90 Age-All.png bash: for i in *.ps; do convert $i -rotate 90 $i.png ; done for i in *.ps; do convert $i -rotate 90 ${i%.ps}.png ; done
strategy summary
four groups of data: + all + whites only + whites SJS + whites TEN
per subject call rates
<PLINK> --missing To generate a list genotyping/missingness rate statistics: plink --file data --missing
This option creates two files:
=> plink.imiss <= plink.lmiss
which detail missingness by individual and by SNP. For individuals, the format is:
FID Family ID IID Individual ID MISS_PHENO Missing phenotype? (Y/N) => N_MISS Number of missing SNPs <= F_MISS Proportion of missing SNPs
=> summary stats, histogram
=> EXCLUDE subjects with call rates below 90% --mind 0.10
The initial step in all data analysis is to exclude individuals with too much missing genotype data. This option is set as follows: plink --file mydata --mind 0.1
which means exclude with more than 10% missing genotypes (this is the defalt value). A line in the terminal output will appear, indicating how many individuals were removed due to low genotyping. If any individuals were removed, a file called
plink.irem
will be created, listing the Family and Individual IDs of these removed individuals. Any subsequent analysis also specifeid on the same command line will be performed without these individuals.
One might instead wish to create a new PED file with these individuals permanently removed, simply add an option to generate a new fileset: for example, plink --file data --mind 0.1 --recode --out cleaned
will generate files
cleaned-recode.ped cleaned-recode.map
with the high-missing-rate individuals removed; alternatively, to create a binary fileset with these individuals removed: plink --file data --mind 0.1 --make-bed --out cleaned
which results in the files
cleaned.bed cleaned.bim cleaned.fam
<PLINK> --genome
plink --file mydata --genome
which creates the file
plink.genome
which has the following fields:
FID1 Family ID for first individual IID1 Individual ID for first individual FID2 Family ID for second individual IID2 Individual ID for second individual Z0 P(IBD=0) Z1 P(IBD=1) Z2 P(IBD=2) => PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) IBS0 Number of IBS 0 nonmissing loci IBS1 Number of IBS 1 nonmissing loci IBS2 Number of IBS 2 nonmissing loci DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) P IBS binomial test HOMHOM Number of IBS 0 SNP pairs used in test HETHET Number of IBS 2 het/het SNP pairs in test RATIO R in ratio of 1:2 for IBS 0 : HETHET
This file will have as many rows as there are unique pairs of individuals in the sample -- for large samples with thousands of individuals, this file can be very large (and take considerable time to generate).
HINT To reduce the file size, use the --minX option to only output to plink.genome pairs where PI_HAT is greater than X. That is, plink --file mydata --genome --min 0.05
" When excluding sample genotypes due to duplication/relatedness, preference be given to 1) the sample with higher call rates and 2) ADR cases. Relatedness exclusions should be carried out to remove any first or second degree relatives (PI-HAT ˜ 0.5 and 0.25, respectively, i.e. all PI-HAT > 0.2)."
=> --min 0.2 => (manually) identify samples that should be removed by comparing/listing call rates (step 1) and ?ADR? cases => manually remove sample from original file and repeat steps to get here
concordance with reported sex
<PLINK> --check-sex
This option uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data). To run this analysis, use the flag: plink --bfile data --check-sex
which generates a file
plink.sexcheck
which contains the fields
FID Family ID IID Individual ID PEDSEX Sex as determined in pedigree file (1=male, 2=female) SNPSEX Sex as determined by X chromosome STATUS Displays "PROBLEM" or "OK" for each individual F The actual X chromosome inbreeding (homozygosity) estimate
A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a femle call is made if F is less than 0.2.
=> Follow up with GSK!!!
Hardy Weinberg equilibrium
Apply to + all white + case white + control white
(this is done automatically, if case/control set correctly)
<PLINK> --hardy
plink --file data --hardy
which creates a file:
plink.hwe
This file has the following format
SNP SNP identifier TEST Code indicating sample GENO Genotype counts: 11/12/22 O(HET) Observed heterozygosity E(HET) Expected heterozygosity P_HWD H-W p-value
For case/control samples, each SNP will have three entries (rows) in this file, with TEST being either ALL, AFF (cases only) or UNAFF (controls only). For quantitative traits, only a single row will appear for each SNP, labelled ALL(QT).
Only founders are considered for the Hardy-Weinberg calculations -- ie. for family data, any offspring are ignored.
WARNING By default, this procedure only considers founders, so no HW results would be given for sibling-only datasets (i.e. if no parents exist). To perform a rough, somewhat biased test, use the --nonfounders option which means that all individuals will be included. Alternatively, manually extract one person per family for this calculation and recode these individuals as founders (see the --keep option to facilitate this).
The default test is an exact one, described and implemented by Wigginton et al (see reference below), which is more accurate for rare genotypes. You can still perform the standard asymptotic test with the --hardy2 option.
=> summary statistics => histograms / density plots => QQ plot of distributions of P_HWD (use qqplot in Matlab) => i.e. qq plot between all three data sets? => P_HWD vs chromosomal positions => chromosomal positions has to come from somewhere... Since there is a large number of chrom. positions maybe that should be done through a database query???
Concordance
Concordance is defined as the presence or absence of the same genotype for the same SNP measured in the same individual from two sources. Concordance is typically expressed as the fraction of SNPs genotyped that are concordant for each individual or as the fraction of individuals that are concordant for each SNP. The following concordance checks are proposed: • Concordance between Illumina 1M genotyped duplicates • Concordance between Illumina 1M and Affymetrix 500K for all overlapping SNPs • Optional: Concordance between Illumina 1M calls and those calls predicted by imputation; method to be determined
=> We need information about duplicate SNPs within Illumina 1M
=> do we have Affymetrix 500K data?
=> optional ????
Patterns of missingness
<PLINK> --cluser-missing <PLINK> --test-missing
cluser=missing
Systematic batch effects that induce missingness in parts of the sample will induce correlation between the patterns of missing data that different individuals display. One approach to detecting correlation in these patterns, that might possibly idenity such biases, is to cluster individuals based on their identity-by-missingness (IBM). This approach use exactly the same procedure as the IBS clustering for population stratification, except the distance between two individuals is based not on which (non-missing) allele they have at each site, but rather the proportion of sites for which two individuals are both missing the same genotype.
To use this option: plink --file data --cluster-missing
which creates the files:
plink.matrix.missing plink.cluster3.missing
which have similar formats to the corresponding IBS clustering files. Specifically, the plink.mdist.missing file can be subjected to a visualisation technique such as multidimensinoal scaling to reveal any strong systematic patterns of missingness.
Note The values in the .mdist file are distances rather than similarities, unlike for standard IBS clustering. That is, a value of 0 means that two individuals have the same profile of missing genotypes. The exact value represents the proportion of all SNPs that are discordantly missing (i.e. where one member of the pair is missing that SNP but the other individual is not).
The other constraints (significance test, phenotype, cluster size and external matching criteria) are not used during IBM clustering. Also, by default, all individuals and all SNPs are included in an IBM clustering analysis, unlike IBS clustering, i.e. even individuals or SNPs with very low genotyping, or monomorphic alleles. By explicitly specifying --mind or --geno or --maf certain individuals or SNPs can be excluded (although the default is probably what is usually required for quality control procedures).
The *.cluster3 file is in the same format as cluster2 (one line per individual) but contains all solutions (i.e. every step of the clustering from moving from N clusters each of 1 individual (leftmost column after family and individual ID) to 1 cluster (labelled 0) containing all N individuals (the final, rightmost column): also shown is the dendrogram this represents: e.g.
A 1 0 0 0 0 B 1 1 1 1 0 C 1 2 1 1 0 D 1 3 2 0 0
NOTE If any constraints have been placed upon the clustering, then solutions represented in the *.cluster3 file may not go as far as 1 cluster with all N individuals: in this case, the file *.cluster2 will contain the final solution (i.e. as far as the clustering could go before running up against constraints, e.g. based on maximum cluster size, etc).
HINT! In large samples, cluster analyses can be very slow. Often the most time consuming step is calculating the pairwise IBS metrics: these only need to be calculated once however, even if you want to run the cluster analysis multiple times (e.g. with different constraints). This is achieved with the --read-genome option, assuming you have previously run the --genome command. It is a good idea to not impose a threshold of the --genome output in this case. For example: plink --bfile mydata --genome --out mydata
followed by multiple clustering commands (see below for descriptions of the cluster constraint parameters used here) plink --bfile mydata --read-genome mydata.genome --cluster --ppc 0.01
and plink --bfile mydata --read-genome mydata.genome --cluster --mc 2 --ibm 0.01
etc.
ADVANCED HINT! In very large samples, cluster analyses can be very, very slow. When calculating the plink.genome file (as described above), if you have access to a cluster of computers for parallel computing, you can use the following approach to greatly reduce the time this step takes.
test-missing
To obtain a missing chi-sq test (i.e. does, for each SNP, missingness differ between cases and controls?), use the option: plink --file mydata --test-missing
which generates a file
plink.missing
which contains the fields
CHR Chromosome number SNP SNP identifier F_MISS_A Missing rate in cases F_MISS_U Missing rate in controls CHISQ_MISS Chi-square (1df) for missingness test P_MISS Asymptotic p-value
The actual counts of missing genotypes are available in the plink.lmiss file, which is generated by the --missing option.
Note This test is only applicable to case/control data.
=> histogram
=> QQ plots
=> further investigation
?? => differential missing by plate ?? => differential missing by hybridization date
Genetic clustering
Illumina 1M genotype data