Difference between revisions of "SAEC protocol"

 
(Data reformating for PLINK)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
{{BJNav}}
 
= Protocol for managing and analyzing the SAEC data =
 
= Protocol for managing and analyzing the SAEC data =
 
== Data reformating for PLINK ==
 
== Data reformating for PLINK ==
1. from "PGX40001_GSK_SJS_B137_29Aug2007_DNAReport.xls" take columns B and C (DNA name, Subject ID) and store it into "mapping_info.txt".
+
1. Get the information about the mapping between Illumina names and GSK names
 +
from "PGX40001_GSK_SJS_B137_29Aug2007_DNAReport.xls" take columns B and C (DNA name, Subject ID) and store it into "mapping_info.txt".
  
 
2. take the four csv files provided by GSK and extract data per individual
 
2. take the four csv files provided by GSK and extract data per individual
[[extractPatients extractPatients]]
+
[[extractPatients | extractPatients.pl]]
  
 
   perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_12278-DNA.csv"
 
   perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_12278-DNA.csv"
Line 15: Line 17:
 
  1069083 1069083 10938646
 
  1069083 1069083 10938646
  
4. copy columns (SubjectID, SEX, SBTY) from C:\SAEC\SJS Delivery from GSK\PGX40001_Clinical\Page1_4_5_7a_8a_9a_10_11_13a.txt to file phenotype.txt
+
4. Generate the phenotype information
 +
copy columns (SubjectID, SEX, SBTY) from C:\SAEC\SJS Delivery from GSK\PGX40001_Clinical\Page1_4_5_7a_8a_9a_10_11_13a.txt to file phenotype.txt
  
 +
If we need to use other phenotypes we can easily create other phenotype files that can be read in by PLINK separately and we don't need to generate the ped files again!
  
 +
5. Generate the .map file for PLINK
 +
In Locus_Annotation_Files>
 +
gawk '{print $2 "\t" $1 "\t" $4 "\t" $3}' Human1M_Physical_and_Genetic_Map_Coordinates.txt > illumina1M.map
  
C:\SAEC\SJS Delivery from GSK\PGX40001_Illumina1M\Documents\Locus_Annotation_Files>gawk '{print $2 "\t" $1 "\t" $4 "\t" $3}' Human1M_Physical_and_Genetic_Map_Coordinates.txt > illumina1M.map
+
6. generate .PED file for PLINK [[makePED | makePED.pl]]
 +
perl makePED.pl -mapfile mapping_info.txt -snpfile snp_ids.txt -dir ../data -phenofile phenotypes.txt -outfile allGSK-10-07.ped
 +
 
 +
== running PLINK ==

Latest revision as of 14:37, 18 October 2007

test file collection | Comments on geWorkbench | SAEC notes | SAEC protocol | SAEC executive summary | Other


Protocol for managing and analyzing the SAEC data

Data reformating for PLINK

1. Get the information about the mapping between Illumina names and GSK names

from "PGX40001_GSK_SJS_B137_29Aug2007_DNAReport.xls" take columns B and C (DNA name, Subject ID) and store it into "mapping_info.txt".

2. take the four csv files provided by GSK and extract data per individual extractPatients.pl

 perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_12278-DNA.csv"
 perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_GSK_SJS_B137_28Aug2007_Genotype_Report_12914-DNA.csv"
 perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_GSK_SJS_B137_28Aug2007_Genotype_Report_12277-DNA.csv"
 perl extractPatients.pl -outfile test.out -outdir ..\data -infile "..\SJS Delivery from GSK\PGX40001_Illumina1M\Extracted Genotypes\PGx40001_GSK_SJS_B137_28Aug2007_Genotype_Report_12276-DNA.csv"

3. sanity check: count all non-comment lines for a sample individual

$ gawk '!/^#/{print $1}' ../data/42.txt  |sort -u |wc
1069083 1069083 10938646

4. Generate the phenotype information copy columns (SubjectID, SEX, SBTY) from C:\SAEC\SJS Delivery from GSK\PGX40001_Clinical\Page1_4_5_7a_8a_9a_10_11_13a.txt to file phenotype.txt

If we need to use other phenotypes we can easily create other phenotype files that can be read in by PLINK separately and we don't need to generate the ped files again!

5. Generate the .map file for PLINK

In Locus_Annotation_Files>
gawk '{print $2 "\t" $1 "\t" $4 "\t" $3}' Human1M_Physical_and_Genetic_Map_Coordinates.txt > illumina1M.map

6. generate .PED file for PLINK makePED.pl

perl makePED.pl -mapfile mapping_info.txt -snpfile snp_ids.txt -dir ../data -phenofile phenotypes.txt -outfile allGSK-10-07.ped

running PLINK