Suppose we have two different assemblies of the same organism, G1.fasta and G2.fasta, and we want to know the which regions in G2 correspond to our interested regions in G1, what should we do? UCSC LiftOver is a great tool to solve this kind of problems and here is how>
#!/usr/bin/sh # split G2 into individual chromosome/scaffold (G2.$i.fa) and save them into ./G2_chr direcotry # I did this step by using a perl script which I wrote, but you can also use UCSC faSplit tool I think. # Say there are n chromosomes/scaffolds in the G2 assembly, so we will have G2.1.fa, G2.2.fa, ..., G2.n.fa. # Split the G2 chromosomes/scaffolds into 3K chunks and make lift files mkdir lift mkdir split for i in {1..n} do faSplit size ./G2_chr/G2.$i.fa 3000 ./split/G2.$i.split -lift=./lift/G2.$i.lft -oneFile done # run blat mkdir psl for i in {1..n} do blat G1.fasta ./split/G2.$i.split.fa -t=dna -q=dna -tileSize=12 -fastMap -minIdentity=95 -noHead -minScore=100 ./psl/G2.$i.psl done # Change coordinates of .psl files to parent coordinate system mkdir liftup for i in {1..n} do liftUp -pslQ ./liftup/G2.$i.liftup.psl ./lift/G2.$i.lft warn ./psl/G2.$i.psl done # Make chain files mkdir chain_raw for i in {1..n} do axtChain -linearGap=medium -faQ -faT -psl ./liftup/G2.$i.liftup.psl ./G1.fasta ./G2.fasta ./chain_raw/$i.chain done # Merge and sort chain files chainMergeSort ./chain_raw/*.chain | chainSplit chain_split stdin faSize G1.fasta -detailed >G1.chr_length.txt faSize G2.fasta -detailed >G2.chr_length.txt # Make alignment nets from chain files mkdir net for i in ./chain_split/*.chain do tag=${i/\.\/chain_split\//} chainNet $i ./G1.chr_length.txt ./G2.chr_length.txt ./net/$tag.net /dev/null done # Create liftOver chain file mkdir over for i in ./chain_split/*.chain do tag=${i/\.\/chain_split\//} netChainSubset ./net/$tag.net $i ./over/$tag.chain done cat ./over/*.chain >G1_G2.over.chain # Make bed file to report converted coordinates. We can give the coordinates of our query regions (based on G1 assembly) in the input.bed file and liftOver will report the converted coordinates in conversion.bed file. liftOver input.bed ./G1_G2.over.chain conversion.bed unMapped
References
http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/doc/liftOver.txt
http://genomewiki.ucsc.edu/index.php/Minimal_Steps_For_LiftOver
http://genomewiki.ucsc.edu/index.php/LiftOver_Howto
Hi Jia-Xing,
Hope you are doing well. I was trying to map coordinates of a denovo assembly(human) I Have to human hg19 reference genome.I am quite new to this thing and trying to learn some basics.
I had a doubt in this step
# run blat
mkdir psl
for i in {1..n}
do
blat G1.fasta ./split/G2.$i.split.fa -t=dna -q=dna -tileSize=12 -fastMap -minIdentity=95 -noHead -minScore=100 ./psl/G2.$i.psl
done
Wouldn’t it be better if I split the G1.fasta into their respective chromosmes and than run blat chromosome by chromosome wise.
Here in the orignal command; split sequences from one chromosomes are mapping to a different chromosome . ( which should not happen as per my understanding).
I would be quite gratefull if you could give some input on this.
Also since I am working with a laorge genome ( Human ) is the above procedure applicable?
Thanking You,
Regards,
Harsh
Hi Harsh,
Thanks for the message!
The blat command usually take two arguments, the first is the database while the second is the query sequence. So in general, you would want to use the larger genome (or more continuous one) as the database and use the other as the query to make the homologous search more efficient. In general, we would like to search against the whole genome rather than a specific chromosome because 1) usually you might not necessarily know which chromosome your query sequence correspond to and 2) we want to account for potential interchromosomal rearrangements such as translocations. So in general, searching against the whole genome is a safer bet. I’ve used this method for converting genome coordinates between two metazoan genomes (genome size = ~500 Mb), so running it for human should not be a problem. For sure it will take longer computational time and consume more RAM and storage space though. :-)
Hope this helps!
Best,
Jia-Xing
Hi Jia-Xing,
Thank you so much for the quick response. I think I understand the reasons and have a much better idea as to how to proceed further.
I tried understanding the liftOver.txt form Kent but I couldn’t make head or tail out of it. This post has been such a big rescue :-)
I would try running the pipeline and see where it leads me.
Thanks for the guidance.
Regards,
Harsh
Great! Glad it helps! :-)