Showing posts with label format. Show all posts
Showing posts with label format. Show all posts

08 September 2014

It's a kind of magic(5)

The following question was recently asked on biostars.org : Tool that detects data types:"More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible descriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3). The detection should be performed without just looking at the extension of the file....":
I suggested to create a new magic file for the linux file command. As an example, a BAM file starts (position=0) with the magic bytes BAM\1. A magic file definition 'bam' for BAM would be:

0 string BAM\1 BAM file v1.0
Compile the new magic file:
file -C -m bam
Now use this magic:
file -z -m bam.mgc file.bam
file.bam: BAM file v1.0 (data)

Now, I've started a github repo containing some 'magic' patterns for bioinformatics (Fastq, blast, bam, bigwig, etc... ): .
(My current problem is to prioritize some results like differentiating a 'Nucleotide' and a 'Protein' Fasta sequence ( http://unix.stackexchange.com/questions/154096/file1-and-magic5-prioritizing-a-result).)

That's it,
Pierre

26 September 2013

Presentation : File formats for Next Generation Sequencing

Here is my presentation for the course "File formats for Next Generation Sequencing", I gave monday at the University of Nantes.

18 July 2013

Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.

Let's annotate a VCF file with the data from the NCBI biosystem.

First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"


It produces a tabix-inded BED mapping the data of 'NCBI biosystem':

$ gunzip -c ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction

I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \
   TABIXFILE=~/ncbibiosystem.bed.gz \
   TAG=NCBIBIOSYS \
   FMT='($4|$5|$6|$7)'|\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40

That's it,

Pierre

06 September 2011

Parsing a BAM file with javascript, yes we can. (Node.js and V8)

Node.js is an event-driven I/O server-side JavaScript environment based on V8, Google's open source JavaScript engine. In the current post I will describe how I've used Node/V8 to parse a BAM file. Here I've used node v0.5.5 and my code is hosted in a new git repository bionode.

Designing a C++ native extension for Node.js wrapping the bgzf format

BAM files are stored using the bgzf format. We must first create a C++ extension wrapping the methods related to bgzf. This process is nicely described in "Writing Node.js Native Extensions". Here, a BGZF* pointer is wrapped into a class BGZFSupport that extends v8::ObjectWrapp:
class BGZFSupport: public ObjectWrap
 {
 private:
   BGZF* file;
 public:
    (...)
   int close()
    {
    int ret=0;
    if(file!=NULL) ret=::bgzf_close(file);
    file=NULL;
    return ret;
    }
   ~BGZFSupport()
    {
   if(file!=NULL) ::bgzf_close(file);
    }
 (...)
The javascript constructor for BGZFSupport opens the bgzfile and is implemented on the C++ side as:
  static Handle<Value> New(const Arguments& args)
    {
    HandleScope scope;
    if (args.Length() < 2)
      {
      RETURN_THROW("Expected two parameters for bgfz");
      }
    if(!args[0]->IsString())
     {
     RETURN_THROW("1st argument is not a string");
     }
    if(!args[1]->IsString())
     {
     RETURN_THROW("2nd argument is not a string");
     }
    
    v8::String::Utf8Value filename(args[0]);
    v8::String::Utf8Value mode(args[1]);
    BGZF* file= ::bgzf_open(ToCString(filename),ToCString(mode));
    if(file==NULL)
     {
     RETURN_THROW("Cannot open \"" << ToCString(filename) <<  "\"");
     }
    BGZFSupport* instance = new BGZFSupport(file);
    instance->Wrap(args.This());
    return args.This();
    }
... and so on for the other functions...

Implementing the javascript-based BAM-Reader

Next, we can embbed this BGZFSupport in a javascript file that will read a BAM file:
var bgzf=require("bgzf");
and we create a javascript class/function BamReader that will open the file as bgzf and will read the BAM header:
var bgzf=require("bgzf");
var Buffer = require('buffer').Buffer;


function BamReader(path)
 {
 this.fd= new bgzf.bgzf(path,"r");
 var b=new Buffer(4);
 var n = this.fd.read(b,0,4);
 if(n!=4) throw new Error("Cannot read 4 bytes");
 if(b[0]!=66)  throw new Error("Error MAGIC[0]");
 if(b[1]!=65)  throw new Error("Error MAGIC[1] got"+b[1]);
 if(b[2]!="M".charCodeAt(0))  throw new Error("Error MAGIC[2]");
 if(b[3]!="\1".charCodeAt(0))  throw new Error("Error MAGIC[3]");
 
 /* l_text */
 n = this.fd.read(b,0,4);
 if(n!=4) throw new Error("Cannot read 4 bytes");
 var l_text=b.readInt32LE(0);
 b=new Buffer(l_text);
 n = this.fd.read(b,0,l_text);
 if(n!=l_text) throw new Error("Cannot read "+l_text+" bytes (l_text)");
 this.text=b.toString('utf-8', 0, l_text);
 
 /* n_seq */
 b=new Buffer(4);
 n = this.fd.read(b,0,4);
 if(n!=4) throw new Error("Cannot read 4 bytes");
 var n_ref=b.readInt32LE(0);
 this.references=[];
 this.name2seq={};
 for(var i=0;i< n_ref;++i)
  {
  var refseq={};
  /* l_name */
  b=new Buffer(4);
  n = this.fd.read(b,0,4);
  if(n!=4) throw new Error("Cannot read 4 bytes");
  var l_name=b.readInt32LE(0);
  /* name */
  b=new Buffer(l_name);
  n = this.fd.read(b,0,l_name);
  if(n!=l_name) throw new Error("Cannot read "+l_name+" bytes (name)");
  refseq.name=b.toString('utf-8', 0,l_name-1);//\0 terminated
  /* l_ref */
  b=new Buffer(4);
  n = this.fd.read(b,0,4);
  if(n!=4) throw new Error("Cannot read 4 bytes");
  refseq.l_ref=b.readInt32LE(0);
  this.references.push(refseq);
  this.name2seq[refseq.name]=refseq;
  }
 //console.log(this.name2seq);
 }
Another function next() reads the next alignment or returns null ( see the code ).

Testing

$ export NODE_PATH=/path/to/bionode/build

the script reads a simple BAM file and prints the positions of the reads:
(...)

var r= new BamReader("/path/to/samtools-0.1.17/examples/toy.bam");
var align;
while((align=r.next())!=null)
 {
 console.log(
  r.references[align.refID].name+"\t"+
  align.read_name+"\t"+
  align.pos
  );
 }
r.close();

Result

$ node bgzf.js
ref r001 6
ref r002 8
ref r003 8
ref r004 15
ref r003 28
ref r001 36
ref2 x1 0
ref2 x2 1
ref2 x3 5
ref2 x4 9
ref2 x5 11
ref2 x6 13


Remaining questions:

At the moment, I don't know how to correctly package the C++ and javascript files for node.js, how to correctly include the files, how to group the different files under a common 'namespace', etc...

That's It,
Pierre

09 July 2011

Storing SNPs in a HDF5 file: my notebook

I discovered the benefits of using twitter in 2008, the day Deepak Singh replied to one of my tweets related to the storage of a large number of genotypes.





Since that day, I've tried to use the HDF5 library, without any success (there's a large disheartening documentation/API on the HDF5 site and the API seems to be only used by a small number of geeks). Furthermore, HDF5 is a technology used by the IGV genome browser.

In that post, I'll describe a C program loading a set of SNPs defined by the following C structure:
typedef struct structSnp
{
char rsId[RS_LENGTH];//rs##
char chrom[CHROM_LENGTH];//chromosome name
int chromStart;//genomic start index
int chromEnd;//genomic end index
}Snp;
The SNPs will be read from stdin. Four columns are expected: rs-id, chrom, chromStart and chromEnd. Here is the command line I used to get a small input file:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz" |\
gunzip -c |\
cut -d ' ' -f 2-5 |\
head -n 10000 > sample.tsv

Source code


The C program is described below. I hope my comments will make the code readable.

The Makefile



Compile and run

gcc -Wall -o a.out -I  /path/to/hdf5/include -L /path/to/hdf5/lib dbsnp2hdf5.c  -lhdf5
./a.out < sample.tsv

Dump the data


At the en, the program creates a structured binary file containing our SNPs. The HDF5 utility h5dump can be used to display the data as text:
$ h5dump storage.h5

HDF5 "storage.h5" {
GROUP "/" {
GROUP "variations" {
DATASET "dbSNP" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 13;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "rs";
H5T_STRING {
STRSIZE 7;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "chrom";
H5T_STD_I32LE "chromStart";
H5T_STD_I32LE "chromEnd";
}
DATASPACE SIMPLE { ( 10000 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
"rs112750067",
"chr1",
10326,
10327
},
(1): {
"rs56289060",
"chr1",
10433,
10433
},
(2): {
"rs112155239",
"chr1",
10439,
10440
},
(3): {
"rs112766696",
"chr1",
10439,
10440
},
(...)
450425,
450426
}
}
ATTRIBUTE "rdfs:comment" {
DATATYPE H5T_STRING {
STRSIZE 15;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "My list of SNPs"
}
}
}
}
}
}


That's it !
Pierre

28 June 2011

The java library for BigBed and BigWig: my notebook

Jim Robinson and his team, from the Broad Institute/IGV, have recently released a java library parsing the BigBed and the BigWig formats. Here is my notebook for this API.

Download and compile the library

The sources are hosted at: http://bigwig.googlecode.com/.
$ svn checkout http://bigwig.googlecode.com/svn/trunk/ bigwig-read-only
$ cd bigwig-read-only
$ ant

Buildfile: build.xml

compile:
[mkdir] Created dir: /path/to/bigwig-read-only/build
[javac] Compiling 38 source files to /path/to/bigwig-read-only/build
[javac] Note: /path/to/bigwig-read-only/src/org/broad/igv/bbfile/BPTree.java uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

dist:
[mkdir] Created dir: /path/to/bigwig-read-only/dist
[jar] Building jar: /path/to/bigwig-read-only/dist/BigWig.jar

BUILD SUCCESSFUL
Total time: 3 seconds

Code

The following java code prints all the Bed or the Wig data in a given genomics region.

Compile

$javac -cp /path/to/bigwig-read-only/dist/BigWig.jar:. BigCat.java

Test

List the data in a BigBed file:
java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9434178 9434609
chr21 9434178 9434609
chr21 9508110 9508214
chr21 9516607 9516987
chr21 9903013 9903230
chr21 9903013 9903230
chr21 9905661 9906613
chr21 9907217 9907519
chr21 9907241 9907415
chr21 9907597 9908258

List the data in a BigBed file for the region: 'chr21:9906000-9908000', allow the overlaps.
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9905661 9906613
chr21 9907217 9907519
chr21 9907241 9907415
chr21 9907597 9908258

List the data in a BigBed file for the region: 'chr21:9906000-9908000', do not allow the overlaps.
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 -c /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9907217 9907519
chr21 9907241 9907415

List the data in a BigWig file:
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat  /path/to/bigwig-read-only/test/data/wigVarStepExample.bw  | head
chr21 9411190 9411195 50.0
chr21 9411195 9411200 40.0
chr21 9411200 9411205 60.0
chr21 9411205 9411210 20.0
chr21 9411210 9411215 20.0
chr21 9411215 9411220 20.0
chr21 9411220 9411225 40.0
chr21 9411225 9411230 60.0
chr21 9411230 9411235 40.0
chr21 9411235 9411240 40.0

List the data in a BigWig file for the region: 'chr21:9906000-9908000'
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 /path/to/bigwig-read-only/test/data/wigVarStepExample.bw  | head
chr21 9906000 9906005 20.0
chr21 9906005 9906010 60.0
chr21 9906010 9906015 60.0
chr21 9906015 9906020 60.0
chr21 9906020 9906025 80.0
chr21 9906025 9906030 60.0
chr21 9906030 9906035 40.0
chr21 9906035 9906040 80.0
chr21 9906040 9906045 80.0
chr21 9906045 9906050 80.0

See also



That's it,

Pierre

19 May 2010

First rule of Bioinfo-Club

The first rule of Bioinfo-Club is:

If you don't like a
Standard File Format,
create a new one


and, I don't like the VCF format. VCF is a text file format (most likely stored in a compressed manner). It contains metainformation lines, a header line, and then data lines each containing information about a position in the genome.. A VCF file looks like this:
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)


The problem is that I want to add some informations about each variation in this file: it started to become messy when I wanted to annotate the mutations (position in the protein ? in the cDNA ? is it non-synonymous ?, which exon ? which gene ? what are the GO terms ?, etc...). So I created a tool named 'VCFToXML' (source available here) transforming the VCF format into XML:
cat file.vcf | java -jar vcf2xml.jar > vcf.xml

The XML-VCF now looks like this:
.
Of course, it is less compact and less easy to process with the common unix tools (cut/sort/awk/etc...). But with the help of a second tool named 'vcfannot' (source available here), I can now add some structured annotations (here some information about UCSC/knownGenes and UCSC/segDups):

A first XSLT stylesheet: back to VCF

Great ! I can use XSLT to transform this XML into another file format. My first XSLT stylesheet xvcf2vcf.xsl transforms the XML-VCF back to the plain text format:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2vcf.xsl vcf.xml

##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02

Second XSLT stylesheet: insert into SQL


The second XSL Stylesheet xvcf2sql.xsl generates a mySQL script for inserting the VCF into a database:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml |\
mysql -D test


Then, for example, the following query scans the database and returns the genotypes/info for each sample:
select
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;

Result:


(xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml; cat query.sql )|\
mysql -D test

chrom position ref alt name Format content
chrN 10109 A T chr.markdup.bam Genotype... 1/1
chrN 10109 A T chr.markdup.bam Read Depth... 4
chrN 10109 A T chr.markdup.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chr.markdup.bam Genotype Q... 45.90
chrN 10109 A T chrN.sorted.bam Genotype... 0/1
chrN 10109 A T chrN.sorted.bam Read Depth... 4
chrN 10109 A T chrN.sorted.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chrN.sorted.bam Genotype Q... 45.90
chrN 10578 G A chr.markdup.bam Genotype... 0/1
chrN 10578 G A chr.markdup.bam Read Depth... 2
chrN 10578 G A chr.markdup.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chr.markdup.bam Genotype Q... 6.02
chrN 10578 G A chrN.sorted.bam Genotype... 0/1
chrN 10578 G A chrN.sorted.bam Read Depth... 2
chrN 10578 G A chrN.sorted.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chrN.sorted.bam Genotype Q... 6.02


That's it
Pierre

20 April 2010

Reading/Writing SAM/BAM files with the picard API for java.

I'm currently playing with the BAM/SAM Java API available in the picard package. This API can be used for creating new programs that read and write SAM files, the generic format for storing large nucleotide sequence alignments. Both SAM text format and SAM binary (BAM) format are supported. The source code comes with a simple example showing how to read and write a SAM/BAM file.

(...)
public void convertReadNamesToUpperCase(
final File inputSamOrBamFile,
final File outputSamOrBamFile
)
{
final SAMFileReader inputSam = new SAMFileReader(inputSamOrBamFile);
final SAMFileWriter outputSam = new SAMFileWriterFactory().makeSAMOrBAMWriter(inputSam.getFileHeader(),
true, outputSamOrBamFile);

for (final SAMRecord samRecord : inputSam) {
// Convert read name to upper case.
samRecord.setReadName(samRecord.getReadName().toUpperCase());
outputSam.addAlignment(samRecord);
}
outputSam.close();
inputSam.close();
}
(...)
toay, I've played with this SAM API to build two simple java tools:

SbamGrep

SbamGrep is available at http://code.google.com/p/code915/wiki/SbamGrep. It filters a SAM/BAM file using the SAM flags.

Option

 -o (filename-out) or default is stdout SAM
-v inverse selection
-f [flag] add bam/sam flag for filtering. multiple separeted with a comma. One of:
READ_PAIRED or 1 or 0x1
PROPER_PAIR_ or 2 or 0x2
READ_UNMAPPED or 4 or 0x4
MATE_UNMAPPED or 8 or 0x8
READ_STRAND or 16 or 0x10
MATE_STRAND or 32 or 0x20
FIRST_OF_PAIR or 64 or 0x40
SECOND_OF_PAIR or 128 or 0x80
NOT_PRIMARY_ALIGNMENT or 256 or 0x100
READ_FAILS_VENDOR_QUALITY_CHECK or 512 or 0x200
DUPLICATE_READ or 1024 or 0x400

Example

The following command lists all the reads having a flag (READ_PAIRED and READ_UNMAPPED and MATE_UNMAPPED and FIRST_OF_PAIR):
java -cp sam-1.16.jar:dist/sbamgrep.jar fr.inserm.umr915.sbamtools.SbamGrep -f 0x4,0x1,0x8,0x40 file.bam
Result (masked):
@HD VN:1.0 SO:unsorted
@SQ SN:chrT LN:349250
IL_XXXX1 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (94**0-)*7=0688855555522@86;;;5;6:;63:4?-622647..-.5.%
IL_XXXX2 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*+*2396,@5+5:@@@;;5)50)6960684;58;86*5102)0*+8:*137;
IL_XXXX3 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/999-00328:88984@@=8??@@:-66,;8;5;6+;255,1;788883676'
IL_XXXX4 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (916928.82@@50854;33222224;@25?5522;5=;;858888555*0666
IL_XXXX5 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9*4*5-**32989+::@82;;853+39;80.53)-)79)..'55.8988*200
IL_XXXX6 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (*+**,14265;@@??@8?9@@@5@488?8666260.)-*9;;;88:8'05418
IL_XXXX7 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (9136242-2@@@;96.888@@@@80$585882623.':**+3*03137..--.
IL_XXXX8 77 * 0 0 * * 0 0 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX (/89255855@88?585557..())@@@;5552286526755@@5888..3;/$
(...)

SbamStats

The second tool, SbamStats is available at http://code.google.com/p/code915/wiki/SbamStats. It provides a simple report about all the SAM flags used in one or more files.

Example

:
java -cp sam-1.16.jar:dist/sbamstats.jar fr.inserm.umr915.sbamtools.SBamStats file.bam
Output:
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 10827
READ_PAIRED:0x1|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 10827
READ_PAIRED:0x1|FIRST_OF_PAIR:0x40 13951
READ_PAIRED:0x1|SECOND_OF_PAIR:0x80 13951
READ_PAIRED:0x1|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 23846
READ_PAIRED:0x1|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 23846
READ_PAIRED:0x1|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 33606
READ_PAIRED:0x1|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 33606
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 143090
READ_PAIRED:0x1|READ_UNMAPPED:0x4|SECOND_OF_PAIR:0x80 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 161781
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 174429
READ_PAIRED:0x1|READ_UNMAPPED:0x4|FIRST_OF_PAIR:0x40 174429
READ_PAIRED:0x1|MATE_UNMAPPED:0x8|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 176219
READ_PAIRED:0x1|READ_UNMAPPED:0x4|READ_STRAND:0x10|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 176219
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|FIRST_OF_PAIR:0x40 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|SECOND_OF_PAIR:0x80 915161
READ_PAIRED:0x1|PROPER_PAIR_:0x2|MATE_STRAND:0x20|SECOND_OF_PAIR:0x80 2623345
READ_PAIRED:0x1|PROPER_PAIR_:0x2|READ_STRAND:0x10|FIRST_OF_PAIR:0x40 2623345
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|SECOND_OF_PAIR:0x80 26232123
READ_PAIRED:0x1|READ_UNMAPPED:0x4|MATE_UNMAPPED:0x8|FIRST_OF_PAIR:0x40 26232123



That's it !
Pierre