NCBI BlastDB
From Informatics
- What labs are using the BLAST databases?
It is available to all labs within GCG, in addition to some labs outside of Columbia University.
- Who is the main “database authority” for the BLAST databases?
Pavel Morozov (pm59<at>columbia.edu), Hans-Erik Aronson
- What kinds of databases are these? (flat-file, relational, XML, object-oriented, etc)
Flat-file - OS text files in Fasta format
- Database Attributes? (Name, BioCategory, Description, Size, Filepath)
NCBI BlastDB
- nt
- nucleotide
- All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences)
- 1.6 million sequences
- /databases/blastdb/db1/ncbi
- nr
- peptide
- All non-redundant GenBank CDS translations+PDB+Swissprot+PIR+PRF
- 4.7 million sequences
- /databases/blastdb/db1/ncbi
- swissprot
- peptide
- SWISS-PROT protein sequence database
- 237,000 sequences
- /databases/blastdb/db1/ncbi
- pataa
- peptide
- protein sequences derived from the Patent division of GenBank
- 380,000 sequences
- /databases/blastdb/db1/ncbi
- patnt
- peptide
- nucleotide sequences derived from the Patent division of GenBank
- 3.7 million sequences
- /databases/blastdb/db1/ncbi
- pdbaa
- peptide
- protein sequences derived from the 3-dimensional PDB
- 29,318 sequences
- /databases/blastdb/db1/ncbi
- pdbnt
- nucleotide
- nucleotide sequences derived from the 3-dimensional PDB
- 7,051 sequences
- /databases/blastdb/db1/ncbi
- est_human
- nucleotide
- Human subset of GenBank+EMBL+DDBJ sequences from EST div
- ~ 8 million sequences
- /databases/blastdb/db1/ncbi
- est_mouse
- nucleotide
- Mouse subset of GenBank+EMBL+DDBJ sequences from EST div
- 4.8 million sequences
- /databases/blastdb/db1/ncbi
- est_others
- nucleotide
- Non-redundant database of all other organisms GenBank+EMBL_DDBJ EST sequences
- ~ 11.9 million sequences
- /databases/blastdb/db1/ncbi
- gss
- nucleotide
- Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences
- ~ 10.5 million sequences
- /databases/blastdb/db1/ncbi
- sts
- nucleotide
- Non-redundant database of GenBank+EMBL+DDBJ STS divisions
- 922,406 sequences
- /databases/blastdb/db1/ncbi
- month.aa
- peptide
- All new or revised GenBank CDS translations + PDB + SwissProt + PIR + PRF released in the last 30 days
- 200,216 sequences
- /databases/blastdb/db1/ncbi
- month.nt
- nucleotide
- All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days
- 114,786 sequences
- /databases/blastdb/db1/ncbi
- mito.aa
- peptide
- database of mitochondrial sequences
- 2,222 sequences
- /databases/blastdb/db1/ncbi
- mito.nt
- nucleotide
- database of mitochondrial sequences
- 129 sequences
- /databases/blastdb/db1/ncbi
- alu.a
- peptide
- translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
- 1,962 sequences
- /databases/blastdb/db1/ncbi
- alu.n
- nucleotide
- select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
- 327 sequences
- /databases/blastdb/db1/ncbi
- vector
- Vector subset of GenBank (R), NCBI
- 911 sequences
- /databases/blastdb/db1/ncbi
- yeast.aa
- peptide
- Yeast amino-acid sequences
- 6,298 sequences
- /databases/blastdb/db1/ncbi
- month.est_human
- nucleotide
- non-redundant database of Human GenBank+EMBL+DDBJ EST sequences
- 61,643 sequences
- /databases/blastdb/db1/ncbi
- month.est_mouse
- nucleotide
- non-redundant database of Mouse GenBank+EMBL+DDBJ EST sequences
- 4,132 sequences
- /databases/blastdb/db1/ncbi
- month.est_others
- nucleotide
- non-redundant database of all other organisms GenBank+EMBL+DDBJ EST sequences
- 211,077 sequences
- /databases/blastdb/db1/ncbi
- Anticipated yearly growth? (Megabytes/Gigabytes)?
It is available to all labs within GCG, in addition to some labs outside of Columbia University.
- Backup procedures? How often?
- Database backups (Hot, Cold, Both) [and/or]
- Operating system backup
OS backup
- What servers/operating systems are hosting them (IP addresses)
adtera.cu-genome.edu
- Approximately how many *active* users?
Not sure - perhaps all AMDeC users
- How often is the database used? (Daily, Weekly, Monthly)
Daily
- What platforms are being used? (Oracle, MySQL, PostgreSQL, etc)
Not applicable (N/A) for RDBMS
- What applications are using these databases?
- Web interface?
- Application (GUI)?
- Command user interface(CUI)?
CUI
BLAST: 99% of time
EMBOSS, BioPerl, HMMER, SSAHA, MUMmer
- Is it accessible from outside the firewall to public users?
YES - users connect to ADGATE via SSH
- What is the primary purpose of the database? (What types of information does it contain?)
Homology - to compare and find similar sequences
It contains nucleotide and amino-acid sequences for numerous species
- Are there any issues or problems with the database?
- Specific error messages popping up?
- Problems connecting from the application or web interface?
- Performance issues (queries are slow, freezes at times, etc)
- etc...
Frequent error message - Segmentation fault
Web-interface no longer works
Formatting the databases fails at times, freezes on occasion
- Would they like help in administering the database?
Not at the moment - it is mostly automatic with some manual labor at times.
- What additional features or changes would users like to see?
- new tables or queries?
- additional screens on application or web interface?
- migrate to different database platform (i.e. MySQL to Oracle)?
none
- May I access the database and if so, what is the login info?
Yes - obtain OS permissions from Hans-Erik