BLAST

From Informatics

Jump to: navigation, search

Contents

BLAST databases

BLAST Inventory

  • What labs are using the BLAST databases?
    It is available to all labs within GCG, in addition to some labs outside of Columbia University.
      </li>
    
    • Who is the main “database authority” for the BLAST databases?
      Pavel Morozov, Hans-Erik Aronson
    • What kinds of databases are these? (flat-file, relational, XML, etc)
      Flat-file - OS text files in Fasta format
    • Database Attributes? (Name, BioCategory, Description, Size, Filepath)

    NCBI BlastDB

      1. nt
        • nucleotide
        • All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
        • 1.6 million sequences
        • /databases/blastdb/db1/ncbi
      2. nr
        • peptide
        • All non-redundant GenBank CDS translations+PDB+Swissprot+PIR+PRF
        • 4.7 million sequences
        • /databases/blastdb/db1/ncbi
      3. swissprot
        • peptide
        • SWISS-PROT protein sequence database
        • 237,000 sequences
        • /databases/blastdb/db1/ncbi
      4. pataa
        • peptide
        • protein sequences derived from the Patent division of GenBank
        • 380,000 sequences
        • /databases/blastdb/db1/ncbi
      5. patnt
        • peptide
        • nucleotide sequences derived from the Patent division of GenBank
        • 3.7 million sequences
        • /databases/blastdb/db1/ncbi
      6. pdbaa
        • peptide
        • protein sequences derived from the 3-dimensional PDB
        • 29,318 sequences
        • /databases/blastdb/db1/ncbi
      7. pdbnt
        • nucleotide
        • nucleotide sequences derived from the 3-dimensional PDB
        • 7,051 sequences
        • /databases/blastdb/db1/ncbi
      8. est_human
        • nucleotide
        • Human subset of GenBank+EMBL+DDBJ sequences from EST div
        • ~ 8 million sequences
        • /databases/blastdb/db1/ncbi
      9. est_mouse
        • nucleotide
        • Mouse subset of GenBank+EMBL+DDBJ sequences from EST div
        • 4.8 million sequences
        • /databases/blastdb/db1/ncbi
      10. est_others
        • nucleotide
        • Non-redundant database of all other organisms GenBank+EMBL_DDBJ EST sequences
        • ~ 11.9 million sequences
        • /databases/blastdb/db1/ncbi
      11. gss
        • nucleotide
        • Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences
        • ~ 10.5 million sequences
        • /databases/blastdb/db1/ncbi
      12. htg
        • nucleotide
        • Unfinished High Throughput Genomic sequences: phases 0,1,2 (finished, phase 3 HTG sequences are in nr)
        • /databases/blastdb/db1/ncbi
      13. sts
        • nucleotide
        • Non-redundant database of GenBank+EMBL+DDBJ STS divisions
        • 922,406 sequences
        • /databases/blastdb/db1/ncbi
      14. month.aa
        • peptide
        • All new or revised GenBank CDS translations + PDB + SwissProt + PIR + PRF released in the last 30 days
        • 200,216 sequences
        • /databases/blastdb/db1/ncbi
      15. month.nt
        • nucleotide
        • All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days
        • 114,786 sequences
        • /databases/blastdb/db1/ncbi
      16. mito.aa
        • peptide
        • database of mitochondrial sequences
      17. mito.nt
        • nucleotide
        • database of mitochondrial sequences
        • <size>
        • <filepath>
      18. alu.a
        • peptide
        • translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
        • <size>
        • <filepath>
      19. alu.n
        • nucleotide
        • select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
        • <size>
        • <filepath>
      20. vector
        • Vector subset of GenBank (R), NCBI
      21. IgSeqProt
        • peptide
        • Kabat database of sequences of immunological
        • <size>
        • <filepath>
      22. IgSeqNt
        • nucleotide
        • Kabat database of sequences of immunological
        • <size>
        • <filepath>
      23. month.est_human
        • nucleotide
        • non-redundant database of Human GenBank+EMBL+DDBJ EST sequences
        • <size>
        • <mouth>
      24. month.est_mouse
        • nucleotide
        • non-redundant database of Mouse GenBank+EMBL+DDBJ EST sequences
        • <size>
        • <filepath>
      25. month.est_others
        • nucleotide
        • non-redundant database of all other organisms GenBank+EMBL+DDBJ EST sequences
        • <size>
        • <filepath>


      26. ipi.HUMAN
        •  
      27. ipi.MOUSE
        •  
      28. ipi.RAT
        •  

    EMBL

    1. IPI
      • International Protein Index
      • <Descript>
      • <size>
      • <filepath

    [C] TIGR:

    1. ARG
      • complete genome
    2. BTM
      • complete genome
    3. GAF
      • complete genome
    4. GBB
      • complete genome
    5. GHI
      • complete genome
    6. GHP
      • complete genome
    7. GMG
      • complete genome
    8. GMT
      • complete genome
    9. GTP
      • complete genome
    10. estfa1
      • ESTs
      • <size>
      • <filepath>
    11. estfa2
      • ESTs
      • <size>
      • <filepath>
    12. estfa3
      • ESTs
      • <size>
      • <filepath>
    13. estfa4
      • ESTs
      • <size>
      • <filepath>
    14. estfa5
      • ESTs
      • <size>
      • <filepath>
    15. s_gordonii
      • genome
      • <size>
      • <filepath>
    16. westfal
      • genome
      • <size>
      • <filepath>

     

    GOLDEN-PATH GENOMES

    OTHER GENOMES

    [mhonig@adgate1 /]$ df -k /database

    Filesystem

    1K-blocks

    Used

    Available

    Use%

    Mounted on

    adterap:/databases

    1682162592

    1071094536

    525619128

    68%

    /adtera/databases

    • Anticipated yearly growth? (Megabytes/Gigabytes)

      NCBI: 20-30 GB

      EMBL: will probably double

      goldenPath: will probably double

    • Backup procedures? How often?
      • Database backups (Hot, Cold, Both) [and/or]
      • Operating system backup

      OS backup

    • What servers/operating systems are hosting them (IP addresses)

      ADTERA ( get IP address from Hans-Erik) > nslookup adtera?

      OS - Red Hat Enterprise Linux AS release 3 (Taroon Update 6)

      • BlastMachine
      • GeneMatcher2
    • Approximately how many *active* users?

      Not sure – perhaps all AMDeC users

    • How often is the database used? (Daily, Weekly, Monthly)

      - Daily

    • What platforms are being used? (Oracle, MySQL, PostgreSQL, etc)

      - Not applicable (N/A) for RDBMS

    • What applications are using these databases?
      • Web interface? – is dead, no current plans to restore it
      • Application (GUI)?
      • Command line interface (CUI)?
        • CUI
        • BLAST: 99% of time
        • EMBOSS, BioPerl, HMMER, SSAHA, MUMmer
    • Is it accessible from outside the firewall to public users?
    • YES – users connect to ADGATE via SSH

    • What is the primary purpose of the database? (What types of information does it contain?)
    • Homology – to compare and find similar sequences

      It contains nucleotide and amino-acid sequences for numerous species

    • Are there any issues or problems with the database?
      • Specific error messages popping up?
      • Problems connecting from the application or web interface?
      • Performance issues (queries are slow, freezes at times, etc)
      • etc...
          • Frequent error message – “Segmentation fault”
          • Web-interface no longer works
          • Formatting the databases fails at times, freezes on occasion
          • Sometimes there are network problems between ADGATE, ADTERA, and the BLASTer machine

           

    • Would they like help in administering the database?
    • Not at the moment - it is mostly automatic with some manual labor at times.

    • What additional features or changes would users like to see?
      • new tables or queries?
      • additional screens on application or web interface?
      • migrate to different database platform (i.e. MySQL to Oracle)?
    • - none

    • May I access the database and if so, what is the login info?

      Yes – obtain OS permissions from Hans-Erik

    Additional Information

    Current copies of many major fasta-format sequence databases are maintained and some that change frequently are updated on a weekly schedule (cron jobs). Formatted copies of each database are maintained for all three major platforms available in the Bioinformatics Core Facility:

  • GeneMatcher2</li>
  • BlastMachine</li>
  • the UNIX hosts ( what are the UNIX hosts)</li>

    The BlastMachine and the GeneMatcher2 are the preferred platforms for sequence searches. Such searches should NOT normally be done directly on the UNIX hosts, but these versions of the databases are available (i.e. – to retrieve sequences of interest using the command ‘fastacmd’.

    Directories for Frequently Updated Sequence Databases:

    Databases that are successfully downloaded are then formatted for the various platforms. The previous version of each database is retained until the next download. On the Linux fileserver ( ADTERA?) and on the BlastMachine, there are two top level directories specifically for frequently updated data. The two directories are updated in alternating fashion, typically at weekly intervals. On the Linux and UNIX hosts, environment variables are automatically set to point to the current version, both for these hosts directly and for the BlastMachine. This ensures that serial searches against a given database name will always use the same database files, even if an update occurs during the session.

    On both the local Linux/UNIX hosts and the BlastMachine, these two directories are called db1 ( /databases/blastdb/db ) and db2 ( /databases/blastdb/db2 ). db1 and db2 currently contain the ncbi and embl directories. As well, both db1 and db2 contain links back up the higher level directories containing static data.

    Due to limited disk space on the GeneMatcher2 system, only a single copy of each database is kept. New runs will always be against the most current version of the database.

     

     

  • Personal tools