21 February 2011

Tabix: Fast retrieval of sequence features from generic TAB-delimited files

This post is about Tabix whose paper has been recently published in Bioinformatics:Bioinformatics. 2011 Jan 5.
Tabix: Fast retrieval of sequence features from generic TAB-delimited files.
Li H.
Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA.


Download &Install

  • Download tabix from
  • Extract & compile:
    > bunzip2 tabix-0.2.3.tar.bz2
    > tar xf tabix-0.2.3.tar
    > cd tabix-0.2.3
    > make

Testing with UCSC knownGene

I downloaded the table KnownGene from the UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
The file was g-unzipped and re-zipped with tabix-0.2.3/bgzip
gunzip knownGene.txt.gz
tabix-0.2.3/bgzip knownGene.txt

knownGene was then indexed with tabix. In this file, chromosome is column 2, txStart is column 4 and txEnd is column 5
tabix-0.2.3/tabix -s 2 -b 4 -e 5 knownGene.txt.gz

Listing the chromosomes:

tabix-0.2.3/tabix knownGene.txt.gz -l
chr1
chr10
chr11
chr12
chr13
chr13_random
(...)
Dumping by region(s):
tabix-0.2.3/tabix knownGene.txt.gz chr2:200000-250000
uc002qvu.1 chr2 - 208154 239852 208154 208154 8 208154,214863,219965,221022,223100,224159,237537,239730, 209001,214920,220044,221191,223229,224272,237602,239852, uc002qvu.1
uc002qvv.1 chr2 - 208154 246690 208810 232800 12 208154,214863,219965,221022,223100,224159,232797,233502,237537,239730,243004,246206, 209001,214920,220044,221191,223229,224272,232871,233562,237602,239844,243115,246690, Q96HL8-3 uc002qvv.1
uc002qvw.1 chr2 - 208154 250702 208154 208154 11 208154,219965,220546,223100,224159,232797,233502,237537,239730,243004,250084, 209001,220044,221191,223229,224272,232871,233562,237602,239844,243115,250702, uc002qvw.1
uc002qvx.1 chr2 - 208154 254024 208810 253984 10 208154,214863,219965,221022,223100,224159,237537,239730,243004,253983, 209001,214920,220044,221191,223229,224272,237602,239844,243115,254024, Q96HL8 uc002qvx.1
uc002qvy.1 chr2 - 208154 254024 208810 253984 9 208154,219965,221022,223100,224159,237537,239730,243004,253983, 209001,220044,221191,223229,224272,237602,239844,243115,254024, Q96HL8-2 uc002qvy.1
uc002qvz.1 chr2 - 208154 254392 208154 208154 10 208154,214867,219965,221022,223100,224159,237537,239730,243004,254083, 209001,214920,220044,221191,223229,224272,237602,239844,243115,254392, uc002qvz.1
uc002qwa.1 chr2 - 208154 254743 208154 208154 12 208154,214863,219965,221022,223100,224159,237537,239730,243004,250084,252200,254702, 209001,214920,220044,221191,223229,224272,237602,239844,243115,251130,252786,254743, uc002qwa.1
uc010ewe.1 chr2 - 208154 254810 208810 232800 11 208154,219965,221022,223100,224159,232797,233502,237537,239730,243004,254781, 209001,220044,221191,223229,224272,232871,233562,237602,239844,243115,254810, Q96HL8-4 uc010ewe.1
uc002qwb.2 chr2 - 229562 232178 229562 229562 1 229562, 232178, uc002qwb.2
uc002qwc.1 chr2 - 233502 252786 233502 233502 6 233502,237537,239730,243004,250084,252630, 233562,237602,239844,243115,250702,252786, uc002qwc.1

Using the JAVA API

The java API for tabix is quite immature: there is only one java file, no package has been declared and some useful functions (chr2tid... )are private. The following java program uses this API to find the 'knownGenes' overlaping the SNPs in UCSC/snp130 :
import java.io.BufferedReader;
import java.io.InputStreamReader;

import java.net.URL;
import java.util.zip.GZIPInputStream;

public class Test {
public static void main(String[] args) {
try {
BufferedReader in=new BufferedReader(

new InputStreamReader(new GZIPInputStream(new URL(
"http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz").openStream())
));
TabixReader tabix=new TabixReader("knownGene.txt.gz");
String line;
while((line=in.readLine())!=null)
{
String tokensSnp[]=line.split("[\t]");
TabixReader.Iterator iter=tabix.query(tokensSnp[1]+":"+tokensSnp[2]+"-"+tokensSnp[3]);//UGLY !
String lineKg;
while (iter != null && (lineKg = iter.next()) != null)
{
String tokensKg[]=lineKg.split("[\t]");
System.out.println(tokensKg[0]+"\t"+tokensSnp[4]+"\t"+tokensSnp[1]+"\t"+tokensSnp[2]+"\t"+tokensSnp[3]);
}
}
} catch (Exception e)
{
e.printStackTrace();
}
}
}
Compilation:
javac -cp /path/to/sam/library/sam-1.36.jar TabixReader.java Test.java

Execution:
java -cp /path/to/sam/library/sam-1.36.jar:. Test

Result:
uc001aaa.2 rs2441671 chr1 1271 1272
uc009vip.1 rs2441671 chr1 1271 1272
uc001aaa.2 rs9803797 chr1 1271 1272
uc009vip.1 rs9803797 chr1 1271 1272
uc001aaa.2 rs2758124 chr1 1319 1320
uc009vip.1 rs2758124 chr1 1319 1320
uc001aaa.2 rs3950659 chr1 1332 1333
uc009vip.1 rs3950659 chr1 1332 1333
uc001aaa.2 rs3877545 chr1 1370 1371
uc009vip.1 rs3877545 chr1 1370 1371
(...)

... and for an unknown reason, the program crashed later...
(...)
uc009ybq.1 rs35852236 chr10 135341235 135341236
uc009ybq.1 rs36132849 chr10 135341235 135341236
uc009ybq.1 rs71238288 chr10 135341235 135341236
uc009ybq.1 rs71238289 chr10 135341255 135341256
uc009ybq.1 rs36155773 chr10 135341256 135341257
uc009ybq.1 rs71189011 chr10 135341267 135341268
uc009ybq.1 rs34231788 chr10 135341287 135341288
java.lang.ArrayIndexOutOfBoundsException: -1
at TabixReader.query(TabixReader.java:325)
at TabixReader.query(TabixReader.java:373)
at Test.main(Test.java:20)


That's it,

Pierre

No comments: