29 June 2015

A BLAST to SAM converter.

Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :

The input for blast2sam is
  • the XML output of NCBI blastn (or stdin)
  • The single or pair of fastq file(s)
  • The reference sequence indexed with picard
.

Example:

fastq2fasta in.R1.fq.gz in.R2.fq.gz |\
blastn -db REFERENCE   -outfmt 5 | \
blast2bam -o result.bam -W 40 -R '@RG   ID:foo  SM:sample' - REFERENCE.dict  in.R1.fq.gz in.R2.fq.gz

Output:
@SQ SN:gi|9629357|ref|NC_001802.1|  LN:9181 
@RG ID:foo  SM:sample
@PG ID:Blast2Bam    PN:Blast2Bam    VN:0.1  CL:../../bin/blast2bam -o results.sam -W 40 -R @RG  ID:foo  SM:sample - db.dict test_1.fastq.gz test_2.fastq.gz
(...)
ERR656485.2 83  gi|9629357|ref|NC_001802.1| 715 60  180S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X6=1S =   715 -119    CCTAGTGTTGCTTGCTTTTCTTCTTTTTTTTTTCAAGCAGAAGACGGCATACGAGATCCTCTATCGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCTAAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAN    (),.((((,(((((,((.((.-(>69>20E>6/=>5EC@9-52?BEE::2951.)74B64=B==FFAF=A??59:>FFFDF:55GGFGF?DFGGFE868>GGGFGGGGED;FGFFGGGGGGGGGGGEFFGE9GGGGFGGGGGGGGDGECGGFGGGGGGGGGGFGGGGGEGGFGGGGGGFFGGGGGFF?EGGFFFEGGGGGGGGFEGGGEGGGFEGGGGGGGGGGDGFFCEGFGGGGGGGGGGGFFECFGGGGFGGGGGGGGGGGFCGGGGGGGGGGGGGGGGGGFGGGGGGGGF@CCA8!    NM:i:13 RG:Z:foo    AS:i:80 XB:f:148.852    XE:Z:4.07e-39
ERR656485.2 163 gi|9629357|ref|NC_001802.1| 715 60  73S7=1X8=1X11=1X2=2X4=1X14=1X8=1X33=1X4=1X2=1X5=1X2=1X8=106S    =   715 119 NAGATAGAGGAAGAACAAAACAAATGTCAGCAAAGTCAGCAAAAGACACAGCAGGAAAAAGGGGCTGACGGGAAGGTCAGTCAAAATTATCCTATAGTGCAAAATCTCCAAGGGCAAATGGTACACCAGGCCATGTCACCTAGAACTTTAAATGCATGGGTAAAAGTAATAGAGGAAAAGGCCTTTAGCCCAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTCTCATTACAAAAAAAACATACACAATAAATGATATAAGCGGAATCAACAGCATGA    !8A@CGGEFGFGCDFGGGGGGGGGGGGGGFGGGGGFGFGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGFGGGGGGGGGEGFGFGGGFFGGGGGGGGFGGGGGGGGGGGGFFFFGGGGGG=FFGGFFDGGGGGGGG8FGFGGGGGGGGGFGGGGGGGGGGFDGGFGGFGGGFFFGFF8DFDFDFFFFFFFFFBCDB<@EAFB@ABAC@CDFF?4>EEFE<*>BDAFB@FFBFF>((6<5CC.;C;=D9106(.))).)-46<<))))))))))((,(-)))()((()))    NM:i:13 RG:Z:foo    AS:i:82 XB:f:152.546    XE:Z:3.15e-40
(...)
Now, I would be interested in finding another dataset where this tool could be successfully used.
That's it,
Pierre

18 June 2015

Playing with the #GA4GH schemas and #Avro : my notebook

After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip

$ find -name "*.avdl"
./schemas-0.5.1/src/main/resources/avro/readmethods.avdl
./schemas-0.5.1/src/main/resources/avro/common.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadata.avdl
./schemas-0.5.1/src/main/resources/avro/wip/metadatamethods.avdl
./schemas-0.5.1/src/main/resources/avro/wip/variationReference.avdl
./schemas-0.5.1/src/main/resources/avro/variants.avdl
./schemas-0.5.1/src/main/resources/avro/variantmethods.avdl
./schemas-0.5.1/src/main/resources/avro/beacon.avdl
./schemas-0.5.1/src/main/resources/avro/references.avdl
./schemas-0.5.1/src/main/resources/avro/referencemethods.avdl
./schemas-0.5.1/src/main/resources/avro/reads.avdl
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
Input files to compile:
  schemas-0.5.1/src/main/resources/avro/variants.avpr
  
$ find generated/org/ -name "*.java"
generated/org/ga4gh/GAPosition.java
generated/org/ga4gh/GAVariantSetMetadata.java
generated/org/ga4gh/GACall.java
generated/org/ga4gh/GAException.java
generated/org/ga4gh/GACigarOperation.java
generated/org/ga4gh/GAVariantSet.java
generated/org/ga4gh/GAVariants.java
generated/org/ga4gh/GAVariant.java
generated/org/ga4gh/GACallSet.java
generated/org/ga4gh/GACigarUnit.java
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro
package test;
import org.apache.avro.*;
import org.apache.avro.io.*;
import org.apache.avro.file.*;
import org.apache.avro.specific.*;
import org.ga4gh.*;
import java.util.*;
public class TestAvro
{
private void run() throws Exception
{
//Write data of a schema.
DatumWriter<GAVariant> variantWriter = new SpecificDatumWriter<GAVariant>( GAVariant.getClassSchema() );
DataFileWriter<GAVariant> dataWriter = new DataFileWriter<GAVariant>(variantWriter);
dataWriter.create(GAVariant.getClassSchema(), System.out);
for(int i=1;i< 10;++i)
{
GAVariant variant = GAVariant.newBuilder()
.setReferenceName("chr1")
.setStart(i)
.setEnd(i)
.setId("rs"+i)
.setVariantSetId("id"+i)
.setReferenceBases("A")
.setCreated(System.currentTimeMillis())
.setUpdated(System.currentTimeMillis())
.setAlternateBases(Arrays.asList("C","T"))
.build()
;
dataWriter.append(variant);
}
dataWriter.close();
}
public static void main(String args[])
{
try
{
TestAvro app = new TestAvro();
app.run();
}
catch(Exception err)
{
err.printStackTrace();
}
}
}
view raw TestAvro.java hosted with ❤ by GitHub


Compile, archive and execute:
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
We use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro

Output:
{"id":"rs1","variantSetId":"id1","names":[],"created":{"long":1434624257608},"updated":{"long":1434624257608},"referenceName":"chr1","start":1,"end":1,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs2","variantSetId":"id2","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":2,"end":2,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs3","variantSetId":"id3","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":3,"end":3,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs4","variantSetId":"id4","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":4,"end":4,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs5","variantSetId":"id5","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":5,"end":5,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs6","variantSetId":"id6","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":6,"end":6,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs7","variantSetId":"id7","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":7,"end":7,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs8","variantSetId":"id8","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":8,"end":8,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
{"id":"rs9","variantSetId":"id9","names":[],"created":{"long":1434624257635},"updated":{"long":1434624257635},"referenceName":"chr1","start":9,"end":9,"referenceBases":"A","alternateBases":["C","T"],"info":{},"calls":[]}
view raw output.txt hosted with ❤ by GitHub

The complete Makefile

.PHONY:all
all: avro-tools-1.7.7.jar schemas-0.5.1/src/main/resources/avro/variants.avpr src/test/TestAvro.java
rm -rf generated
mkdir -p generated
#generate java classes
java -jar $< compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated
#compile classes
javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java
# archive
jar cvf generated/ga4gh.jar -C generated org -C generated test
# run
java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avro
#decode
java -jar avro-tools-1.7.7.jar tojson variant.avro
avro-tools-1.7.7.jar :
curl -L -o $@ "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/$@"
schemas-0.5.1/src/main/resources/avro/variants.avpr: avro-tools-1.7.7.jar schemas-0.5.1/src/main/resources/avro/variants.avdl
java -jar $< idl $(filter %.avdl,$^) > $@
schemas-0.5.1/src/main/resources/avro/variants.avdl :
rm -rf schemas-0.5.1
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip"
unzip schema.zip
rm schema.zip
view raw Makefile hosted with ❤ by GitHub


That's it,
Pierre