Using anaconda to manage local python environment



It happens that different bioinformatic tools that we want to install requires specific python library versions. What makes it worse is  that sometimes the python libraries that they required could be incompatible with each other, leaving our local python environment as a messy place to manage. Anaconda is a great tool to manage local python environment and libraries. With Anaconda, we can easily create isolated virtual environment for a specific python version:

# download the latest Anaconda distribution for Linux
$ wget
# use conda to create a python 3.5 environment named "py35"
$ conda create -n py35 python=3.5 anaconda
# activate the py35 environment
$ source activate py35
# to verify the current active python version
$ python --version


Local installation of ncbi-blast+ together with the nr and taxonomy database


, , ,

It has been a while since I installed my local nr and taxonomy database last time. This week, I need to do this again for a different server,  so I think it might be worthwhile to write a brief note to record whole process for my future reference.

If the ncbi-blast+ software has not been installed for our system, we need to first download and install if before setting up those two databases. For this example, my installed ncbi-blast+ version number is 2.2.31, but you can change it to which ever version that you like in this list:

Open a terminal, type:

$ BLAST_VERSION="2.2.31"
$ wget "${BLAST_VERSION}/ncbi- blast-${BLAST_VERSION}+-x64-linux.tar.gz"
$ tar -zxf ncbi-blast-${BLAST_VERSION}+-x64-linux.tar.gz
$ cd ncbi-blast-${BLAST_VERSION}+/bin
$ pwd

Add the returned full path to the $PATH environment variable in the ~/.bash_profile or ~/.bashrc file.

Now that we have installed ncbi-blast+, we need to create an environmental variable $BLASTDB to point to the directory where we want to store our NCBI nr database and the taxonomy database. Say we want to set $BLASTDB to be “/home/users/DB”, we can do this by type:

$ echo "export BLASTDB=\"/home/users/DB\"" >> ~/.bashrc

Now we can use the script (shipped with ncbi-blast+) to download and set up the nr database:

$ perl  --passive --timeout 300 --force --verbose nr
$ ls *.gz |xargs -n1 tar -xzvf
$ rm *.gz

We will install the taxonomy database in a similar way:

$ perl --passive --timeout 300 --force --verbose taxdb
$ tar xzvf taxdb.tar.gz

Now you should be able to run local blast against the nr database by running

$ blastp -query $query -num_threads 4 -evalue 1E-6 -db nr > out.blastp.txt

To further retrieve the taxonomy information of each hit, you need to set customized output format, such as:

$ blastp -query $query -num_threads 4 -evalue 1E-6 -db nr -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames' >output.blastp.with_taxid.out

In this way, the taxonomy information of BLAST hits will be reported in the last two columns (i.e. staxids and sscinames) of the output file. Here “staxid” means Subject Taxonomy ID and “ssciname” means Subject Scientific Name. Below is a complete list of supported output fields for defining customized output (only work for the output format 6, 7, or 10).

qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ‘;’
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a ‘/’
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ‘;’
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ‘;’
scomnames means unique Subject Common Name(s), separated by a ‘;’
sblastnames means unique Subject Blast Name(s), separated by a ‘;’
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ‘;’
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a ‘<>’
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)






Installing Augustus with manual bamtools installation


, ,

Augustus is a very popular tool for gene annotation, however its installation process can be a bit tricky. For example, if we just download and install Augustus like below, changes are it will not work.

tar xvzf augustus-3.2.3.tar.gz
cd augustus-3.2.3

You will probably get these errors:

cd auxprogs && make
make[1]: Entering directory '/home/jxyue/Tools/augustus-3.2.3/auxprogs'
cd bam2hints; make;
make[2]: Entering directory '/home/jxyue/Tools/augustus-3.2.3/auxprogs/bam2hints'
g++ -Wall -O2    -c -o bam2hints.o -I/usr/include/bamtools fatal error: api/BamReader.h: No such file or directory
 #include <api/BamReader.h>
compilation terminated.
Makefile:29: recipe for target 'bam2hints.o' failed
make[2]: *** [bam2hints.o] Error 1
make[2]: Leaving directory '/home/jxyue/Tools/augustus-3.2.3/auxprogs/bam2hints'
Makefile:7: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/home/jxyue/Tools/augustus-3.2.3/auxprogs'
Makefile:7: recipe for target 'all' failed
make: *** [all] Error 2

This is because the auxiliary tools, bam2hints and filterBam, depend on the pre-installation of bamtools with proper library path configured.

Therefore, let’s install bamtools first. It will be much easier if we just do system-wide installation like:

sudo apt-get install bamtools libbamtools-dev

However, when we are working with a shared computing server and we do not have the root permission, we mostly likely will need install bamtools under our local directory like below.

git clone git://
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/your/path/to/bamtools ..
make install

Now bamtools should have been correctly installed. Next, we need to modify the Makefiles of bam2hints and filterBam to adapt them with our manually installed bamtools.

First, go to the “augustus-3.2.3/auxprogs/bam2hints” directory and make the following changes for the Makefile:

INCLUDES = /usr/include/bamtools
INCLUDES = $(BAMTOOLS)/include/bamtools

LIBS = -lbamtools -lz
LIBS = $(BAMTOOLS)/lib64/libbamtools.a -lz

Then, go to the “augustus-3.2.3/auxprogs/filterBam/src” directory and make the following changes for the Makefile:

BAMTOOLS = /usr/include/bamtools
# BAMTOOLS = /usr/include/bamtools

INCLUDES = -I$(BAMTOOLS) -Iheaders -I./bamtools
INCLUDES = -I$(BAMTOOLS)/include/bamtools -Iheaders -I./bamtools

LIBS = -lbamtools -lz
LIBS = $(BAMTOOLS)/lib64/libbamtools.a -lz

Now, we are finally ready to compile Augustus. Get back to the “augustus-3.2.3” directory and type “make BAMTOOLS=/your/path/to/bamtools”, viola!


Custom installation of PacBio’s GenomicConsensus (Quiver)


, , ,

It has always been a tricky experience for me to install individual PacBio’s SMRTanalysis tools on isolated machines that do not have the SMRTanalysis package pre-installed. Below is a summary of my recent custom installation of the GenomicConsensus (Quiver) tool, which is used for “polishing” (error-correction) PacBio assemblies.

First of all, I installed Conda to set up an independent python environment that is isolated from the python environment pre-configured for the server. To keep compatible with the SMRTanalysis tools,  I chose python2.7.9 for my conda python environment.

conda install python=2.7.9

Then log out and re-log in.

Install required Python libraries NumPy and h5py, which can be installed via:

conda install numpy=1.12.1 conda install h5py=2.7.0

Now we are ready to install required PacBio libraries

pip install git+
git clone
cd ConsensusCore
python install --swig=$SWIG --boost=$BOOST

Here you should replace $SWIG with the path to your swig executable and $BOOST with the path to your boost installation (the top level directory). (Note that if SWIG is in your $PATH and boost is in /usr/local or /usr/include/, you do not need to specify these flags on the command line. will find them automatically).

Now, we can install GenomicConsensus (Quiver). Given the significant revision made for GenomicConsensus recently, here I install an older version (v1.1.0) to make sure the installation can succeed. ( I also tried the most recent version but it didn’t work.)

git clone
cd GenomicConsensus
git reset --hard 654d0276d4a03f269cd1a14ddd7dfd0f54bede45
python install

Now, GenomicConsensus (Quiver) should have been successfully installed. Let’s test!

quiver -h

usage: --referenceFilename REFERENCEFILENAME -o
                        OUTPUTFILENAMES [-j NUMWORKERS]
                        [--minConfidence MINCONFIDENCE]
                        [--minCoverage MINCOVERAGE]
                        [--noEvidenceConsensusCall {nocall,reference,lowercasereference}]
                        [--coverage COVERAGE] [--minMapQV MINMAPQV]
                        [--referenceWindow REFERENCEWINDOWSASSTRING]
                        [--referenceWindowsFile REFERENCEWINDOWSASSTRING]
                        [--barcode _BARCODE] [--readStratum READSTRATUM]
                        [--algorithm ALGORITHM]
                        [--parametersFile PARAMETERSFILE]


Compiling 64-bit MUMmer


I encountered the following error when running nucmer (part of the MUMmer package) for a large genome assembly:

mummer: suffix tree construction failed: textlen=727301215 larger than maximal textlen=536870908

It appears that we can work around this problem by compiling a 64-bit version of mummer. There are two ways of doing this:

  1. Edit the MUMmer3.23/src/kurtz/libbasedir/types.h file.  We need to add the following line “#define SIXTYFOURBITS” before the line “#ifdef SIXTYFOURBITS”. Then run make install to compile the package.
  2. When compiling the package, use the following command:
    make clean

Either way should work.


Installing tRNAscan-SE and snoscan


, , ,

For tRNAscan:

tar -zxvf tRNAscan-SE.tar.gz
ln -s tRNAscan-SE-1.3.1 tRNAscan-SE
cd tRNAscan-SE 

we might need to edit the Makefile if we want to customize the installation directory, here I used emacs for this:

 emacs Makefile

I modified the following lines from:

BINDIR  = $(HOME)/bin
LIBDIR  = $(HOME)/lib/tRNAscan-SE
MANDIR  = $(HOME)/man


BINDIR  = $(HOME)/local/bin
LIBDIR  = $(HOME)/local/lib/tRNAscan-SE
MANDIR  = $(HOME)/local/share/man

Then, we are ready to compile the program.

make install
chmod 755 ~/local/bin/tRNAscan-SE

When running tRNAscan-SE, I got the following error message:
“Can’t locate tRNAscanSE/ in @INC (@INC contains:…”

So we need to patch the following line in ~/local/bin/tRNAscan-SE at line 28 so that tRNAscan-SE can find the modules that it needs.

use lib "/home/<my_userid>/local/bin";


For snocan:

tar xvzf snoscan-0.9b.tar.Z
ln -s snoscan-0.9b snoscan
cd snoscan

According to the README file, we need to compile squid-1.5j first, so

cd squid-1.5j

Again, we can customize the installation path like we did for tRNAscan-SE:

SQUIDHOME  = $(HOME)/local/lib/squid
BINDIR     = $(HOME)/local/bin
SCRIPTDIR  = $(HOME)/local/scripts
MANDIR     = $(HOME)/local/share/man

There is a type conflict in the original Makefile of squid-1.5j, so we need to do this:

sed -i 's/getline/getLine/g' sqio.c

then we can compile squid-1.5j:


Switch back to the directory of snocan, and modify the Makefile to set the installation path

cd ..
emacs Makefile

BINDIR  = $(HOME)/local/bin
MANDIR  = $(HOME)/local/share/man

make install




Wrapping R functions for use in Perl scripts



It comes in handy when we can use R functions in our Perl scripts sometimes. So how could we do that?
Here is an exmaple:

Say we want to wrap the R function phyper(), dhyper(), ppois() and atan() as a Perl module named “rmath”,
first we need to create a file (here I named it as “”) with the following content:

%module rmath
       double phyper (double x, double NR, double NB, double n, int lower_tail, int log_p) ;
       double dhyper (double x, double NR, double NB, double n, int log_p) ;
       double ppois(double x, double lambda, int lower_tail, int log_p) ;
       double atan(double x) ;
       double phyper (double x, double NR, double NB, double n, int lower_tail, int log_p) ;
       double dhyper (double x, double NR, double NB, double n, int log_p) ;
       double ppois(double x, double lambda, int lower_tail, int log_p) ;
       double atan(double x) ;

We will use swig ( to wrap it. In Mac OSX, we can simply install it using homebrew (

# install swig with homebrew
brew install swig

We need to download a temporary copy of R for compiling the standalone Rmath library.

# download R 
tar xvzf R-3.2.1.tar.gz
ln -s R-3.2.1 R
cd R
./configure --prefix=/<Path to this temporary R>/R
cd ./src/nmath/standalone
make install

Now we use swig to wrap the R functions that we need into Perl modules.

swig -perl rmath.i 

rm *.o 
for f in *.c ; 
gcc -I. -I../../../src/include -I../../../src/include -I./.. -I/usr/local/include -DHAVE_CONFIG_H -DMATHLIB_STANDALONE -arch x86_64 -g -O2 -std=gnu99 -c $f -o ${f/.c/.o} `perl -MExtUtils::Embed -e ccopts`;
ls -l ${f/.c/.o}; 
done rm sunif.o; 

gcc -bundle -flat_namespace -undefined dynamic_lookup -o rmath.bundle *.o
sudo cp rmath.bundle  /usr/local/ActivePerl-5.18/site/lib/ 
sudo cp  /usr/local/ActivePerl-5.18/site/lib/ 

Note: I am using ActivePerl-5.18 here since ActivePerl-5.22 doesn’t work for this for some reason.


Installing blasr on the the CentOS/RHEL system


, ,

I am working heavily with PacBio data recently, so it will be nice to have PacBio’s blasr ( installed on our local server.

First, make sure the library glibc, glibc-devel and glibc-static have been pre-installed on the system. I got errors when compiling blasr without glibc-devl and glibc-static libraries.

Second, we need to install the hdf5 library. It is a bit tricky with regard to this. I firstly installed the latest version hdf5-1.8.15-patch1 but I got errors when compiling blasr with this version. I did a bit  research online and found out this is due to an API change of the hdf5 library happened since the version hdf5-1.8.12. So I removed the hdf5-1.8.15-patch1 and installed hdf5-1.8.11 instead. Here I recorded the commands that I used for installing the hdf501.8.11 library and blasr as follows.


# download and install hdf5-1.8.11
tar -xvzf hdf5-1.8.11.tar.gz
cd hdf5-1.8.11
./configure --enable-cxx --prefix=/home/yjx/local/lib/libhdf5 --with-zlib=/home/yjx/local/lib/libhdf5
# I don't have the root permission so I installed this library locally under my home directory (/home/yjx). 
# You can use "--prefix=/usr/local/lib/libhdf5 --with-zlib=/usr/local/lib/libhdf5" instead if you have root permission.

make check &gt; check.out
# Please check the content of the check.out file generated here to make sure everything is OK
make install &gt; install.log
cd ..

# download and install blasr
tar -xvzf smrtanalysis-2.2.tar.gz
cd blasr-smrtanalysis-2.2
# edit the file to set the path to your hdf library like this:
HDF5INCLUDEDIR ?= /home/yjx/local/lib/libhdf5/include
HDF5LIBDIR     ?= /home/yjx/local/lib/libhdf5/lib

# Now we are ready to compile blasr. Just type "make":
# We probably want to put the executable files that we compiled into a dedicated directory.
make install ASSEMBLY_HOME=/home/yjx/Programs/smrtanalysis/analysis 
# Now you should see all the executable files were copied to the following directory:
# Then, add this path to your .bashrc or .bash_profile file.
export PATH
# also add the following line:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/yjx/local/lib/libhdf5/lib
# now restart a login session and type "blasr -h" to test. You should see the help information for 
# blasr correctly displayed.

 blasr -h
   Options for blasr 
   Basic usage: 'blasr reads.{fasta,bas.h5} genome.fasta [-options] 
 option    Description (default_value).

 Input Files.
   reads.fasta is a multi-fasta file of reads.  While any fasta file is valid input, 
               it is preferable to use pls.h5 or bas.h5 files because they contain
               more rich quality value information.

   reads.bas.h5|reads.pls.h5 Is the native output format in Hierarchical Data Format of 
               SMRT reads. This is the preferred input to blasr because rich quality
               value (insertion,deletion, and substitution quality values) information is 
               maintained.  The extra quality information improves variant detection and mapping



Install genewise on the CentOS/RHEL system


, ,

I am trying to deploy a gene annotation pipeline for eukaryote genomes recently. The genewise (or “Wise2”) package ( is one of many packages that I need for this purpose. The version that I am trying to install here is the latest one (v2.4.1), which can be obtained following this link: . From the installation instruction, it seemed all I need to do is simply typing “make all” in the “./src” directory but in reality there are a few more steps needed before I get it compiled successfully. So I am recording these steps here for my future reference and hope it can help anyone who encountered the same problem.


# install glib2-devel package if it has not been installed on the system
<span class="im">sudo yum install -y glib2-devel</span>

# download and decompress the package
wget <a title="" href="" target="_blank"></a>
tar xvzf wise2.4.1.tar.gz

# make some necessary changes in the makefile and source files.
cd wise2.4.1/src
# Here we need to modify makefile to replace "CC = cc" with "CC = gcc" in Line 25 and save the change
find ./ -name makefile |xargs sed -i 's/glib-config/pkg-config --libs glib-2.0/'
cd ./HMMer2/
sed 's/getline/getline_new/' sqio.c  > a &&  mv a sqio.c
cd ../models/
# Change phasemodel.c to replace "if( !isnumber(line[0]) ) {" with "if( !isdigit(line[0]) ) {" in Line 23
cd ..
make all

Now genewise should be correctly compiled and you can find in the “wise2.4.1/src/bin” directory.


Using LiftOver to convert genome assembly coordinates


, , ,

Suppose we have two different assemblies of the same organism, G1.fasta and G2.fasta, and we want to know the which regions in G2 correspond to our interested regions in G1, what should we do? UCSC LiftOver is a great tool to solve this kind of problems and here is how>


# split G2 into individual chromosome/scaffold (G2.$i.fa) and save them into ./G2_chr direcotry
# I did this step by using a perl script which I wrote, but you can also use UCSC faSplit tool I think.
# Say there are n chromosomes/scaffolds in the G2 assembly, so we will have G2.1.fa, G2.2.fa, ..., G2.n.fa.

# Split the G2 chromosomes/scaffolds into 3K chunks and make lift files
mkdir lift
mkdir split
for i in {1..n}
  faSplit size ./G2_chr/G2.$i.fa  3000 ./split/G2.$i.split  -lift=./lift/G2.$i.lft -oneFile

# run blat
mkdir psl
for i in {1..n}
  blat G1.fasta ./split/G2.$i.split.fa -t=dna -q=dna -tileSize=12 -fastMap -minIdentity=95 -noHead -minScore=100  ./psl/G2.$i.psl

# Change coordinates of .psl files to parent coordinate system
mkdir liftup
for i in {1..n}
	liftUp -pslQ ./liftup/G2.$i.liftup.psl ./lift/G2.$i.lft warn ./psl/G2.$i.psl

# Make chain files
mkdir chain_raw
for i in {1..n}
	axtChain -linearGap=medium  -faQ -faT -psl ./liftup/G2.$i.liftup.psl ./G1.fasta ./G2.fasta ./chain_raw/$i.chain

# Merge and sort chain files
chainMergeSort ./chain_raw/*.chain | chainSplit chain_split stdin

faSize G1.fasta  -detailed >G1.chr_length.txt
faSize G2.fasta  -detailed >G2.chr_length.txt

# Make alignment nets from chain files
mkdir net
for i in  ./chain_split/*.chain
 chainNet $i ./G1.chr_length.txt ./G2.chr_length.txt ./net/$ /dev/null

# Create liftOver chain file
mkdir over
for i in ./chain_split/*.chain
  netChainSubset  ./net/$ $i ./over/$tag.chain

cat ./over/*.chain >G1_G2.over.chain

# Make bed file to report converted coordinates. We can give the coordinates of our query regions (based on G1 assembly) in the input.bed file and liftOver will report the converted coordinates in conversion.bed file.
liftOver input.bed  ./G1_G2.over.chain conversion.bed unMapped