23 September 2011

Joining genomic annotations files with the tabix API.

Tabix is a software that is part of the samtools package. After indexing a file, tabix is able to quickly retrieve data lines overlapping genomic regions (see also my previous post about tabix). Here, I wrote a tool named jointabix that joins the data of a (chrom/start/end) file with a file indexed with tabix. I've posted the code on github at: https://github.com/lindenb/samtools-utilities/blob/master/src/jointabix.c.

Usage


$ jointabix  -h

Usage: jointabix (options) {stdin|file|gzfiles}:

  -d   column delimiter. default: TAB
  -c   chromosome column (1).
  -s   start column (2).
  -e   end column (2).
  -i   ignore lines starting with ('#').
  -t   tabix file (required).
  +1 add 1 to the genomic coodinates.
  -1 remove 1 to the genomic coodinates.
 

Example:

In the following example, I'm going to join the SNPs from the 1000 genome project with the "cytoband" database of the UCSC.

##download and index UCSC-cytobands:
$ wget -O cytoBand.txt.gz "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz"
$ gunzip cytoBand.txt.gz
$ bgzip cytoBand.txt


$ curl -s  "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" |\
   gunzip -c |\
   sed 's/^\([^#]\)/chr\1/' |\
   cut -d '   ' -f 1-5 |\
   jointabix -c 1 -s 2 -e 2 -1 -f cytoBand.txt.gz |\
   grep -v "##"
 
#CHROM	POS	ID	REF	ALT
chr1	10327	rs112750067	T	C	chr1	0	2300000	p36.33	gneg
chr1	10469	rs117577454	C	G	chr1	0	2300000	p36.33	gneg
chr1	10492	rs55998931	C	T	chr1	0	2300000	p36.33	gneg
chr1	10583	rs58108140	G	A	chr1	0	2300000	p36.33	gneg
chr1	11508	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	11565	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	12783	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	13116	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	13327	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	13980	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	14699	.	C	G	chr1	0	2300000	p36.33	gneg
chr1	14930	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	14933	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	14948	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	15118	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	15211	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	15274	.	A	T	chr1	0	2300000	p36.33	gneg
chr1	15820	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	16206	.	T	A	chr1	0	2300000	p36.33	gneg
chr1	16257	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	16280	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	16298	.	C	T	chr1	0	2300000	p36.33	gneg
chr1	16378	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	16495	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	16534	.	C	T	chr1	0	2300000	p36.33	gneg
chr1	16841	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	28376	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	28563	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	30860	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	30885	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	30923	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	31295	.	A	C	chr1	0	2300000	p36.33	gneg
chr1	31467	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	31487	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	40261	.	C	A	chr1	0	2300000	p36.33	gneg
chr1	46633	.	T	A	chr1	0	2300000	p36.33	gneg
chr1	48183	.	C	A	chr1	0	2300000	p36.33	gneg
chr1	48186	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	49272	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	49298	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	49554	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	51479	rs116400033	T	A	chr1	0	2300000	p36.33	gneg
chr1	51673	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	51803	rs62637812	T	C	chr1	0	2300000	p36.33	gneg
chr1	51898	rs76402894	C	A	chr1	0	2300000	p36.33	gneg
chr1	52058	rs62637813	G	C	chr1	0	2300000	p36.33	gneg
chr1	52238	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	52727	.	C	G	chr1	0	2300000	p36.33	gneg
chr1	54353	.	C	A	chr1	0	2300000	p36.33	gneg
(...)

That's it,

Pierre

No comments:

Post a Comment