, , ,

It has been a while since I installed my local nr and taxonomy database last time. This week, I need to do this again for a different server,  so I think it might be worthwhile to write a brief note to record whole process for my future reference.

If the ncbi-blast+ software has not been installed for our system, we need to first download and install if before setting up those two databases. For this example, my installed ncbi-blast+ version number is 2.2.31, but you can change it to which ever version that you like in this list: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/

Open a terminal, type:

$ BLAST_VERSION="2.2.31"
$ wget "ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/${BLAST_VERSION}/ncbi- blast-${BLAST_VERSION}+-x64-linux.tar.gz"
$ tar -zxf ncbi-blast-${BLAST_VERSION}+-x64-linux.tar.gz
$ cd ncbi-blast-${BLAST_VERSION}+/bin
$ pwd

Add the returned full path to the $PATH environment variable in the ~/.bash_profile or ~/.bashrc file.

Now that we have installed ncbi-blast+, we need to create an environmental variable $BLASTDB to point to the directory where we want to store our NCBI nr database and the taxonomy database. Say we want to set $BLASTDB to be “/home/users/DB”, we can do this by type:

$ echo "export BLASTDB=\"/home/users/DB\"" >> ~/.bashrc

Now we can use the update_blastdb.pl script (shipped with ncbi-blast+) to download and set up the nr database:

$ perl update_blastdb.pl  --passive --timeout 300 --force --verbose nr
$ ls *.gz |xargs -n1 tar -xzvf
$ rm *.gz

We will install the taxonomy database in a similar way:

$ perl update_blastdb.pl --passive --timeout 300 --force --verbose taxdb
$ tar xzvf taxdb.tar.gz

Now you should be able to run local blast against the nr database by running

$ blastp -query $query -num_threads 4 -evalue 1E-6 -db nr > out.blastp.txt

To further retrieve the taxonomy information of each hit, you need to set customized output format, such as:

$ blastp -query $query -num_threads 4 -evalue 1E-6 -db nr -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames' >output.blastp.with_taxid.out

In this way, the taxonomy information of BLAST hits will be reported in the last two columns (i.e. staxids and sscinames) of the output file. Here “staxid” means Subject Taxonomy ID and “ssciname” means Subject Scientific Name. Below is a complete list of supported output fields for defining customized output (only work for the output format 6, 7, or 10).

qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ‘;’
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a ‘/’
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ‘;’
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ‘;’
scomnames means unique Subject Common Name(s), separated by a ‘;’
sblastnames means unique Subject Blast Name(s), separated by a ‘;’
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ‘;’
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a ‘<>’
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)