12 December 2013

Inside Jvarkit: view BAM, cut, stats, head, tail, shuffle, downsample, group-by-gene VCFs...

Here are a few tools I recently wrote (and reinvented) for Jvarkit.

BamViewGui
a simple java-Swing-based BAM viewer.
VcfShuffle
Shuffle a VCF.
GroupByGene
Group VCF data by Gene
$ curl -s -k "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" |\
java -jar dist/groupbygene.jar |\
head | column  -t

#chrom  min.POS    max.POS    gene.name  gene.type         samples.affected  count.variations  M10475  M10478  M10500  M128215
chr10   52004315   52004315   ASAH2      snpeff-gene-name  2                 1                 0       0       1       1
chr10   52004315   52004315   ASAH2      vep-gene-name     2                 1                 0       0       1       1
chr10   52497529   52497529   ASAH2B     snpeff-gene-name  2                 1                 0       1       1       0
chr10   52497529   52497529   ASAH2B     vep-gene-name     2                 1                 0       1       1       0
chr10   48003992   48003992   ASAH2C     snpeff-gene-name  3                 1                 1       1       1       0
chr10   48003992   48003992   ASAH2C     vep-gene-name     3                 1                 1       1       1       0
chr10   126678092  126678092  CTBP2      snpeff-gene-name  1                 1                 0       0       0       1
chr10   126678092  126678092  CTBP2      vep-gene-name     1                 1                 0       0       0       1
chr10   135336656  135369532  CYP2E1     snpeff-gene-name  3                 2                 0       2       1       1
DownSampleVcf
Down sample a VCF.
VcfHead
Print the first variants of a VCF.
VcfTail
Print the last variants of a VCF
VcfCutSamples
Select/Exclude some samples from a VCF
VcfStats>
Generate some statistics from a VCF. The ouput is a XML file that can be processed with xslt.
$ curl  "https://raw.github.com/arq5x/gemini/master/test/test4.vep.snpeff.vcf" |\
  java -jar dist/vcfstats.jar |\
  xmllint --format -

<?xml version="1.0" encoding="UTF-8"?>
<vcf-statistics version="314bf88924a4003e6d6189ad3280d8b4df485aa1" input="stdin" date="Thu Dec 12 16:20:14 CET 2013">
  <section name="General">
    <statistics name="general" description="general">
      <counts name="general" description="General" keytype="string">
        <property key="num.dictionary.chromosomes">93<
         (...)

That's it,

Pierre

20 November 2013

Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"




Example : piping bwa mem


$ bwa mem -M  human_g1k_v37.fasta  Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
  java -jar dist/bam2fastq.jar  -F tmpR1.fastq.gz -R tmpR2.fastq.gz
before:
$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz

after (these are Haloplex Data, with a lot of duplicates )

$ ls -lah tmpR1.fastq.gz  tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb  96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz

using BZ2:

$  ls -lah *.bz2
-rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2
-rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2

That's it
Pierre

30 October 2013

GNU Make: saving the versions of the tools using 'order-only-prerequisites' : my notebook

Rule 3 of "Ten Simple Rules for Reproducible Computational Research". is
:

Archive the Exact Versions of All External Programs Used
.
I work with Makefile-based workflows: how can I save the version of each software used when I invoke 'make', whatever the target is ? A naive solution is to add a dependency to each target. For example, the following makefile takes a simple SAM file, convert it to BAM, sort and index. For each target, I've added a dependency named "dump_params" that append the version of samtools to a file "config.txt".
.PHONY: dump_params all clean
.SHELL=/bin/bash
all: sorted.bam.bai dump_params
sorted.bam.bai: sorted.bam dump_params
samtools index $<
sorted.bam: unsorted.bam dump_params
samtools sort $< $(basename $@)
unsorted.bam : samtools-0.1.18/examples/toy.sam dump_params
samtools view -Sb $< > $@
dump_params:
date >> config.txt && \
echo -n "Samtools " >> config.txt && \
samtools 2>&1 | grep Version >> config.txt
clean:
rm -f sorted.bam.bai sorted.bam unsorted.bam
view raw Makefile.1 hosted with ❤ by GitHub

But that solution doesn't work because make re-builds all targets even if the top target already exists.
$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam


$ make

date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

The solution I got via Stackoverflow is to use a order-only-prerequisites: "Order-only prerequisites can be specified by placing a pipe symbol (|) in the prerequisites list: any prerequisites to the left of the pipe symbol are normal; any prerequisites to the right are order-only... (...) Note that if you declare the same file to be both a normal and an order-only prerequisite, the normal prerequisite takes precedence (...)". The makefile with the 'order-only-prerequisites' is now:

.PHONY: dump_params all clean
.SHELL=/bin/bash
all: sorted.bam.bai dump_params
sorted.bam.bai: sorted.bam | dump_params
samtools index $<
sorted.bam: unsorted.bam | dump_params
samtools sort $< $(basename $@)
unsorted.bam : samtools-0.1.18/examples/toy.sam | dump_params
samtools view -Sb $< > $@
dump_params:
date >> config.txt && \
echo -n "Samtools " >> config.txt && \
samtools 2>&1 | grep Version >> config.txt
clean:
rm -f sorted.bam.bai sorted.bam unsorted.bam
view raw Makefile.2 hosted with ❤ by GitHub



And that works ! the final target is generated only once, but the file 'config.txt' is always generated.
$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
samtools view -Sb samtools-0.1.18/examples/toy.sam < unsorted.bam
[samopen] SAM header is present: 2 sequences.
samtools sort unsorted.bam sorted
samtools index sorted.bam

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt

$ make
date << config.txt && \
 echo -n "Samtools " << config.txt && \
 samtools  2<&1 | grep Version << config.txt
That's it,
Pierre

Update :another solution

Citing MadScientist's answer on stackoverflow : Another option is to use immediately expanded shell functions, like:
__dummy := $(shell echo "Makefile was run." >> config.txt)
Since it's immediately expanded the shell script will be invoked once, as the makefile is read in. There's no need to define a dump_params target or include it as a prerequisite. This is more old-school, but has the advantage that it will run for every invocation of make, without having to go through and ensure every target has the proper order-only prerequisite defined.




25 October 2013

YES: "Choice of transcripts and software has a large effect on variant annotation"

This post was inspired by Aaron Quinlan's tweet:


Here is an example of a missense mutation found with VCFPredictions, a simple tool I wrote for variant effect prediction.

#CHROM POS ID ALT REF 
1 23710805 rs605468 A G
my tool uses the UCSC knownGene track, here is the context of the variant in the UCSC genome browser. There is one exon for TCEA3 (uc021oig.1) on the reverse strand.


If the base at 23710805 is changed from 'A'→'G' on the forward strand, there will be a non-synonymous variation Y(TAT)→H(CAT) on the reverse strand.


At the NCBI rs605468 is said to be " located in the intron region of NM_003196.1."

VEP cannot find this missense variation:

Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
rs605468 1:23710805 G - ENSR00001518296 Transcript regulatory_region_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000476978 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204
rs605468 1:23710805 G ENSG00000204219 ENST00000450454 Transcript intron_variant - - - - - rs605468 GMAF=A:0.1204

(of course, my tool doesn't find some variations found in VEP too)

That's it,
Pierre

24 October 2013

PubMed Commons & Bioinformatics: a call for action


NCBI pubmed Commons/@PubMedCommons is a new system that enables researchers to share their opinions about scientific publications. Researchers can comment on any publication indexed by PubMed, and read the comments of others.
Now that we can add some comments to the papers in pubmed, I suggest to flag the articles to mark the deprecated softwares, databases, hyperlinks using a simple controlled syntax. Here are a few examples: the line starts with '!MOVED' or '!NOTFOUND' and is followed by a URL or/and a list of PMID or/and a quoted comment.

Examples

!MOVED: for http://www.ncbi.nlm.nih.gov/pubmed/8392714 (Rebase/1993) to http://www.ncbi.nlm.nih.gov/pubmed/19846593 (Rebase/2010)
!MOVED PMID:19846593 "A more recent version" 
In http://www.ncbi.nlm.nih.gov/pubmed/19919682 the URL has moved to http://biogps.org.
!MOVED <http://biogps.org> 
I moved the sources of http://www.ncbi.nlm.nih.gov/pubmed/9682060 to github
!MOVED <https://github.com/lindenb/cloneit/tree/master/c> 
!NOTFOUND: for http://www.ncbi.nlm.nih.gov/pubmed/9545460 ( Biocat EXCEL template ) url http://www.ebi.ac.uk/biocat/biocat.html returns a 404.
!NOTFOUND "The URL http://www.ebi.ac.uk/biocat/biocat.html was not found " 


That's it,

Pierre

23 October 2013

Inside the variation toolkit: Generating a structured document describing an Illumina directory.

I wrote a tool named "Illuminadir" : it creates a structured (JSON or XML) representation of a directory containing some Illumina FASTQs (I only tested it with HiSeq , paired end-data and indexes).

Motivation

Illuminadir scans folders , search for FASTQs and generate a structured summary of the files (xml or json).
Currently only tested with HiSeq data having an index

Compilation

See also Compilation

ant illuminadir

Options

Option Description
IN=File root directories This option may be specified 1 or more times.
JSON=Boolean json output Default value: false. Ouput, could be used with jsvelocity https://github.com/lindenb/jsvelocity

Example

$ java  -jar dist/illuminadir.jar \
	I=dir1 \
	I=dir2 | xsltproc xml2script.xslt > script.bash
(...)

XML output

The XML ouput looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<illumina>
  <!--com.github.lindenb.jvarkit.tools.misc.IlluminaDirectory IN=[RUN62_XFC2DM8ACXX/data]    JSON=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false-->
  <directory path="RUN62_XFC2DM8ACXX/data">
    <samples>
      <sample name="SAMPLE1">
        <pair md5="cd4b436ce7aff4cf669d282c6d9a7899" lane="8" index="ATCACG" split="2">
          <fastq md5filename="3369c3457d6603f06379b654cb78e696" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.gz" file-size="359046311"/>
          <fastq md5filename="832039fa00b5f40108848e48eb437e0b" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_002.fastq.gz" file-size="359659451"/>
        </pair>
        <pair md5="b3050fa3307e63ab9790b0e263c5d240" lane="8" index="ATCACG" split="3">
          <fastq md5filename="091727bb6b300e463c3d708e157436ab" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_003.fastq.gz" file-size="206660736"/>
          <fastq md5filename="20235ef4ec8845515beb4e13da34b5d3" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_003.fastq.gz" file-size="206715143"/>
        </pair>
        <pair md5="9f7ee49e87d01610372c43ab928939f6" lane="8" index="ATCACG" split="1">
          <fastq md5filename="54cb2fd33edd5c2e787287ccf1595952" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_001.fastq.gz" file-size="354530831"/>
          <fastq md5filename="e937cbdf32020074e50d3332c67cf6b3" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_001.fastq.gz" file-size="356908963"/>
        </pair>
        <pair md5="0697846a504158eef523c0f4ede85288" lane="7" index="ATCACG" split="2">
          <fastq md5filename="6fb35d130efae4dcfa79260281504aa3" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L007_R1_002.fastq.gz" file-size="357120615"/>
(...)
      <pair md5="634cbb29ca64604174963a4fff09f37a" lane="7" split="1">
        <fastq md5filename="bc0b283a58946fd75a95b330e0aefdc8" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane7/lane7_Undetermined_L007_R1_001.fastq.gz" file-size="371063045"/>
        <fastq md5filename="9eab26c5b593d50d642399d172a11835" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane7/lane7_Undetermined_L007_R2_001.fastq.gz" file-size="372221753"/>
      </pair>
      <pair md5="bf31099075d6c3c7ea052b8038cb4a03" lane="8" split="2">
        <fastq md5filename="f229389da36a3efc20888bffdec09b80" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_002.fastq.gz" file-size="374331268"/>
        <fastq md5filename="417fd9f28d24f63ce0d0808d97543315" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_002.fastq.gz" file-size="372181102"/>
      </pair>
      <pair md5="95cab850b0608c53e8c83b25cfdb3b2b" lane="8" split="3">
        <fastq md5filename="23f5be8a962697f50e2a271394242e2f" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_003.fastq.gz" file-size="60303589"/>
        <fastq md5filename="3f39f212c36d0aa884b81649ad56630c" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_003.fastq.gz" file-size="59123627"/>
      </pair>
      <pair md5="ab108b1dda7df86f33f375367b86bfe4" lane="8" split="1">
        <fastq md5filename="14f8281cf7d1a53d29cd03cb53a45b4a" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_001.fastq.gz" file-size="371255111"/>
        <fastq md5filename="977fd388e1b3451dfcdbf9bdcbb89ed4" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_001.fastq.gz" file-size="370744530"/>
      </pair>
    </undetermined>
  </directory>
</illumina>

How to use that file ? here is a example of XSLT stylesheet that can generate a Makefile to generate a LaTeX about the number of reads per Lane/Sample/Index

<?xml version='1.0'  encoding="ISO-8859-1"?>
<xsl:stylesheet
	xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
	version='1.0' 
	> 
<xsl:output method="text"/>


<xsl:template match="/">
.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $&lt;

report.tex : all.count
	echo 'T&lt;-read.table("$&lt;",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2&lt;-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" &gt; $@
	echo "\usepackage{graphicx}" &gt;&gt; $@
	echo "\date{\today}" &gt;&gt; $@
	echo "\title{FastQ Report}" &gt;&gt; $@
	echo "\begin{document}" &gt;&gt; $@
	echo "\maketitle" &gt;&gt; $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" &gt;&gt; $@ ; )
	echo "\end{document}" &gt;&gt; $@
	


all.count : $(addsuffix .count, <xsl:for-each select="//fastq" ><xsl:value-of select="@md5filename"/><xsl:text> </xsl:text></xsl:for-each>) 
	echo -e "Lane\tsplit\tside\tsize\tcount\tIndex\tSample"  &gt; $@ &amp;&amp; \
	cat $^ &gt;&gt; $@

<xsl:apply-templates select="//fastq" mode="count"/>

clean:
	rm -f all.count report.pdf report.tex $(addsuffix .count, <xsl:for-each select="//fastq" ><xsl:value-of select="@md5filename"/><xsl:text> </xsl:text></xsl:for-each>) 

</xsl:template>

<xsl:template match="fastq" mode="count">
$(addsuffix .count, <xsl:value-of select="@md5filename"/>): <xsl:value-of select="@path"/>
	gunzip -c $&lt; | awk '(NR%4==1)' | wc -l  | xargs  printf "<xsl:value-of select="../@lane"/>\t<xsl:value-of select="../@split"/>\t<xsl:value-of select="@side"/>\t<xsl:value-of select="@file-size"/>\t%s\t<xsl:choose><xsl:when test="../@index"><xsl:value-of select="../@index"/></xsl:when><xsl:otherwise>Undetermined</xsl:otherwise></xsl:choose>\t<xsl:choose><xsl:when test="../../@name"><xsl:value-of select="../../@name"/></xsl:when><xsl:otherwise>Undetermined</xsl:otherwise></xsl:choose>\n"   &gt; $@

</xsl:template>
</xsl:stylesheet>
$ xsltproc  illumina.xml illumina2makefile.xsl > Makefile
output:
.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $<

report.tex : all.count
	echo 'T<-read.table("$<",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2<-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" > $@
	echo "\usepackage{graphicx}" >> $@
	echo "\date{\today}" >> $@
	echo "\title{FastQ Report}" >> $@
	echo "\begin{document}" >> $@
	echo "\maketitle" >> $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" >> $@ ; )
	echo "\end{document}" >> $@



all.count : $(addsuffix .count, 3369c3457d6603f06379b654cb78e696 832039fa00b5f40108848e48eb437e0b 091727bb6b300e463c3d708e157436ab 20235ef4ec88....)
	echo -e "Lane\tsplit\tside\tsize\tcount\tIndex\tSample"  > $@ && \
	cat $^ >> $@


$(addsuffix .count, 3369c3457d6603f06379b654cb78e696): RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.gz
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "8\t2\t1\t359046311\t%s\tATCACG\tSAMPLE1\n"   > $@


$(addsuffix .count, 832039fa00b5f40108848e48eb437e0b): RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_002.fastq.gz
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "8\t2\t2\t359659451\t%s\tATCACG\tSAMPLE1\n"   > $@
(....)

JSON output

The JSON output looks like this

[{"directory":"RUN62_XFC2DM8ACXX/data","samples":[{"sample":"SAMPLE1","files":[{
"md5pair":"cd4b436ce7aff4cf669d282c6d9a7899","lane":8,"index":"ATCACG","split":2
,"forward":{"md5filename":"3369c3457d6603f06379b654cb78e696","path":"20131001_SN
L149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.g
z","side":1,"file-size":359046311},"reverse":{"md5filename":"832039fa00b5f401088
48e48eb437e0b","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/
SAMPLE1_ATCACG_L008_R2_002.fastq.gz","side":2,"file-size":359659451}},{"md5pair"
:"b3050fa3307e63ab9790b0e263c5d240","lane":8,"index":"ATCACG","split":3,"forward
":{"md5filename":"091727bb6b300e463c3d708e157436ab","path":"20131001_SNL149_0062
_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_003.fastq.gz","side"
:1,"file-size":206660736},"reverse":{"md5filename":"20235ef4ec8845515beb4e13da34
b5d3","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_A
TCACG_L008_R2_003.fastq.gz","side":2,"file-size":206715143}},{"md5pair":"9f7ee49
e87d01610372c43ab928939f6","lane":8,"index":"ATCACG","split":1,"forward":{"md5fi
lename":"54cb2fd33edd5c2e787287ccf1595952","path":"20131001_SNL149_0062_XFC2DM8A
CXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_001.fastq.gz","side":1,"file-
size":354530831},"reverse":{"md5filename":"e937cbdf32020074e50d3332c67cf6b3","pa
th":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L00
8_R2_001.fastq.gz","side":2,"file-size":356908963}},{"md5pair":"0697846a504158ee
f523c0f4ede85288","lane":7,"index":"ATCACG","split":2,"forward":{"md5filename":"
It can be processed using a tool like jsvelocity to generate the same kind of Makefile:

The velocity template for jsvelocity


#macro(maketarget $fastq)

$(addsuffix .count, ${fastq.md5filename}): ${fastq.path}
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "${fastq.parentNode.lane}\t${fastq.parentNode.split}\t${fastq.side}\t${fastq['file-size']}\t%s\t#if(${fastq.parentNode.containsKey("index")})${fastq.parentNode.index}#{else}Undetermined#{end}\t#if(${fastq.parentNode.parentNode.containsKey("name")})${fastq.parentNode.parentNode.name}#{else}Undetermined#{end}\n"   > $@

#end

.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $<

report.tex : all.count
	echo 'T<-read.table("$<",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2<-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" > $@
	echo "\usepackage{graphicx}" >> $@
	echo "\date{\today}" >> $@
	echo "\title{FastQ Report}" >> $@
	echo "\begin{document}" >> $@
	echo "\maketitle" >> $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" >> $@ ; )
	echo "\end{document}" >> $@

all.count : $(addsuffix .count, #foreach($dir in $all) #foreach($sample in ${dir.samples})#foreach($pair in ${sample.files}) ${pair.forward.md5filename}  ${pair.reverse.md5filename} #end #end #foreach($pair in   ${dir.undetermined}) ${pair.forward.md5filename}  ${pair.reverse.md5filename} #end  #end )



#foreach($dir in $all)
#foreach($sample in ${dir.samples})
#foreach($pair in ${sample.files})
#maketarget($pair.forward)
#maketarget($pair.reverse)
#end
#end
#foreach($pair in   ${dir.undetermined})
#maketarget($pair.forward)
#maketarget($pair.reverse)
#end 
#end


clean:
	rm -f all.count  $(addsuffix .count,  #foreach($dir in $all)
#foreach($sample in ${dir.samples})
#foreach($pair in ${sample.files}) ${pair.forward.md5filename}  ${pair.reverse.md5filename}  #end #end
#foreach($pair in   ${dir.undetermined}) ${pair.forward.md5filename}  ${pair.reverse.md5filename}  #end  #end )

transform using jsvelocity:


java -jar dist/jsvelocity.jar \
     -d all illumina.json \
      illumina.vm > Makefile

ouput: same as above

That's it,

Pierre

PS: This post was generated using the XSLT stylesheet :"github2html.xsl" and https://github.com/lindenb/jvarkit/wiki/Illuminadir.

17 October 2013

Rapid prototyping of a read-only Lims using JSON and Apache Velocity.

In a previous post, I showed how to use the Apache Velocity template engine to format JSON data.



Since that post, I've moved my application to a github repository: https://github.com/lindenb/jsvelocity. The project contains a java-based standalone tool to process the JSON data.
Here is an example: The JSON data:

{
individuals:[
    {
    name: "Riri",
    age: 8,
    duck: true
    },
    {
    name: "Fifi",
    age: 9,
    duck: true
    },
    {
    name: "Loulou",
    age: 10,
    duck: true
    }
    ]
}
.... and the velocity template:
#foreach($indi in ${all.individuals})
<h1>${indi['name']}</h1>
Age:${indi.age}<br/>${indi.duck}
#end
... with the following command line ...
$ java -jar dist/jsvelocity.jar \
    -f all test.json \
    test.vm
... produces the following output ...
<h1>Riri</h1>
Age:8<br/>true
<h1>Fifi</h1>
Age:9<br/>true
<h1>Loulou</h1>
Age:10<br/>true


Today I wrote a web version of the tool using the jetty server. I wanted to quickly write a web interface to display various summaries for our NGS experiments.
My JSON input looks like this:
{
"sequencer":[
 {
 "name":"HiSeq"
 },
 {
 "name":"MiSeq"
 }
 ],
"run":[ {
 "sequencer":"HiSeq",
 "flowcell":"C2AVTACXX",
 "name":"131010_C2AVTACXX_61",
 "date":"2013-10-10",
 "comment":"A comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 10
  },
  {
  "name":"CD0002",
  "meancov": 20.0
  }
  ,
  {
  "name":"CD0003",
  "meancov": 30.0
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C3VATACYY",
 "name":"131011_C3VATACYY_62",
 "date":"2013-10-11",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 11
  },
  {
  "name":"CD0006",
  "meancov": 21.0
  }
  ,
  {
  "name":"CD0008",
  "meancov": null
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C4VATACYZ",
 "name":"131012_C4VATACYZ_63",
 "date":"2013-10-12",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0010",
  "meancov":1,
  "comment":"Failed, please, re-sequence"
  }
  ]
 }
 
 
 ],
"samples":[ 
 { "name":"CD0001" },
 { "name":"CD0002" },
 { "name":"CD0003" }, 
 { "name":"CD0004" }, 
 { "name":"CD0005" }, 
 { "name":"CD0006" }, 
 { "name":"CD0007" }, 
 { "name":"CD0008" },
 { "name":"CD0009" },
 { "name":"CD0010" }
 ],
"projects":[
 {
 "name":"Disease1",
 "description": "sequencing Project 1",
 "samples":["CD0001","CD0002","CD0006","CD0009"]
 },
 {
 "name":"Disease2",
 "description": "sequencing Project 2",
 "samples":["CD0002","CD0003","CD0008","CD0009"]
 }
 ]

}
One velocity template is used to browse this 'database': https://github.com/lindenb/jsvelocity/blob/master/src/test/resources/velocity/lims.vm.
The server is started like:
java -jar dist/webjsvelocity.jar  \
    -F lims src/test/resources/json/lims.json \
    src/test/resources/velocity/lims.vm

2013-10-17 12:43:35.566:INFO:oejs.Server:main: jetty-9.1.0.M0
2013-10-17 12:43:35.602:INFO:oejs.ServerConnector:main: Started ServerConnector@72dcb6{HTTP/1.1}{0.0.0.0:8080}
(...)
And here is a screenshot of the result:






That's it,

Pierre

15 October 2013

GNU parallel for #bioinformatics : my notebook

This notebook about GNU parallel follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).



That's it,
Pierre

07 October 2013

Software Environment Management with Modules: my notebook


The following question was recently asked on Biostar "Bioinformatics: how to version control small scripts located all over the server". I suggested to put the scripts in a central repository (under git or whatever ) and to use symbolic links in the workspaces to manage the files. On the other hand, Alex Reynolds suggested to use a Module to deploy versions of a given package.

http://modules.sourceforge.net/: " The Environment Modules package provides for the dynamic modification of a user's environment via modulefiles. Each modulefile contains the information needed to configure the shell for an application. Once the Modules package is initialized, the environment can be modified on a per-module basis using the module command which interprets modulefiles. Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. modulefiles may be shared by many users on a system and users may have their own collection to supplement or replace the shared modulefiles."




This tool was also suggested on cited by Lex Nederbragt on twitter:



I've just played with Modules, here is my (short) experience.

I've got two versions of samtools on my server : 0.1.18 and 0.1.19, I want to easily switch from one version to another.

I created a hidden directory in my home:
mkdir ~/.modules
it contains a directory "samtools" that will contain the two 'modules files' for the two versions of samtools:
mkdir ~/.modules/samtools

The directory '~/.modules' is added to the variable ${MODULESPATH} ( The path that the module command searches when looking for modulefiles )
$ export MODULEPATH=${MODULEPATH}:${HOME}/.modules

Create the Module File '${HOME}/.modules/samtools/0.1.18' for samtools 0.1.18: This module file add the PATH to samtools0.1.18 and bcftools0.1.18
#%Module1.0

proc ModulesHelp { } {
global dotversion
puts stderr "\tSamtools"
}

module-whatis "Samtools 0.1.18" 
prepend-path PATH /commun/data/packages/samtools-0.1.18/
prepend-path PATH /commun/data/packages/samtools-0.1.18/bcftools

Create the Module File '${HOME}/.modules/samtools/0.1.19' for samtools 0.1.19: This module file add the PATH to samtools0.1.19 and bcftools0.1.19
#%Module1.0

proc ModulesHelp { } {
global dotversion
puts stderr "\tSamtools"
}

module-whatis "Samtools 0.1.19" 
prepend-path PATH /commun/data/packages/samtools-0.1.19/
prepend-path PATH /commun/data/packages/samtools-0.1.19/bcftools

We also create a file '${HOME}/.modules/samtools/.version' to define the default version of samtools:
#%Module1.0
set ModulesVersion "0.1.19"

On startup the shell doesn't know samtools or bcftools:
$ which samtools
/usr/bin/which: no samtools in (/commun/sge_root/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/lindenb/bin)

$ which bcftools
/usr/bin/which: no bcftools in (/commun/sge_root/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/lindenb/bin)
List the available modules :
$ module avail                                                                                                

------------------------------------------------ /usr/share/Modules/modulefiles ------------------------------------------------
dot           module-cvs    module-info   modules       mpich2-x86_64 null          use.own

---------------------------------------------------- /home/lindenb/.modules ----------------------------------------------------
samtools/0.1.18          samtools/0.1.19(default)
Let's load the default configuration for samtools:
$ module load samtools

now, the shell know the PATH to samtools:

$ which samtools
/commun/data/packages/samtools-0.1.19/samtools
$ which bcftools
/commun/data/packages/samtools-0.1.19/bcftools/bcftools

Unload the module for samtools:
$ module unload samtools
$ which samtools
/usr/bin/which: no samtools in (/commun/sge_root/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/lindenb/bin)

$ which bcftools
/usr/bin/which: no bcftools in (/commun/sge_root/bin/lx24-amd64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/lindenb/bin)

Now load the (old) version 0.1.18 of samtools
$ module load samtools/0.1.18

The old versions of samtools and bcftools are now in the $PATH:
$ which samtools
/commun/data/packages/samtools-0.1.18/samtools

$ which bcftools
/commun/data/packages/samtools-0.1.18/bcftools/bcftools
That's it,

Pierre

26 September 2013

Presentation : File formats for Next Generation Sequencing

Here is my presentation for the course "File formats for Next Generation Sequencing", I gave monday at the University of Nantes.

Presentation: Advanced NCBI

Here is my presentation for the course "Advanced NCBI", I gave yesterday at the University of Nantes.


10 September 2013

Building a simple & stupid LIMS with the Eclipse Modeling Framework (#EMF), my notebook.

I played with the Eclipse Modeling Framework (EMF) and created a simple interface to manage a list of sequenced samples.


I've uploaded my notes on slideshare:



That's it,

Pierre

26 July 2013

g1kv37 vs hg19

In order to create a class to translate the chromosome names from one naming convention to another. I've compared the MD5 sums of the human genome versions g1k/v37 and ucsc/hg19. Here is the java program to create the MD5s:

import java.io.*;
import java.security.MessageDigest;
public class FastaMD5
{
public static void main(String args[]) throws Exception
{
int len=0;
byte[] buffer = new byte[1];
MessageDigest complete = null;
for(;;)
{
int c=System.in.read();
switch(c)
{
case -1: case '>':
{
if(complete!=null)
{
for(byte b:complete.digest())
{
System.out.print(Integer.toString( (b & 0xff ) + 0x100, 16).substring( 1 ));
}
System.out.println("\t"+len);
complete=null;
len=0;
}
if(c==-1) return;
while((c=System.in.read())!=-1 && c!='\n') System.out.print((char)c);
System.out.print('\t');
complete=MessageDigest.getInstance("MD5");
len=0;
break;
}
case '\n':case ' ':case '\r': break;
default:
{
buffer[0]=(byte)Character.toUpperCase(c);
complete.update(buffer, 0, 1);
++len;
break;
}
}
}
}
}
view raw FastaMD5.java hosted with ❤ by GitHub

The MD5 sums were extracted as follow:
$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" | gunzip -c | java FastaMD5 > a.txt
$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz" | gunzip -c | tar Oxvf - 2> /dev/null | java FastaMD5 > b.txt
##join
$ join -t ' ' -1 2 -2 2 <(sort -t ' ' -k2,2 a.txt ) <(sort -t ' ' -k2,2 b.txt ) | cut -d ' ' -f 1,2,4 | sort -t ' ' -k3,3
#unjoinable
$ join -t ' ' -1 2 -2 2 -v 1 -v 2 <(sort -t ' ' -k2,2 a.txt ) <(sort -t ' ' -k2,2 b.txt ) | sort -t ' ' -k2,2
view raw compare.sh hosted with ❤ by GitHub



Here are the common chromosomes, joined on the hash-sum:
1b22b98cdeb4a9304cb5d48026a85128 1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 chr1
988c28e000e84c26d552359af1ea2e1d 10 dna:chromosome chromosome:GRCh37:10:1:135534747:1 chr10
98c59049a2df285c76ffb1c6db8f8b96 11 dna:chromosome chromosome:GRCh37:11:1:135006516:1 chr11
06cbf126247d89664a4faebad130fe9c GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1 chr11_gl000202_random
51851ac0e1a115847ad36449b0015864 12 dna:chromosome chromosome:GRCh37:12:1:133851895:1 chr12
283f8d7892baa81b510a015719ca7b0b 13 dna:chromosome chromosome:GRCh37:13:1:115169878:1 chr13
98f3cae32b2a2e9524bc19813927542e 14 dna:chromosome chromosome:GRCh37:14:1:107349540:1 chr14
e5645a794a8238215b2cd77acb95a078 15 dna:chromosome chromosome:GRCh37:15:1:102531392:1 chr15
fc9b1a7b42b97a864f56b348b06095e6 16 dna:chromosome chromosome:GRCh37:16:1:90354753:1 chr16
351f64d4f4f9ddd45b35336ad97aa6de 17 dna:chromosome chromosome:GRCh37:17:1:81195210:1 chr17
96358c325fe0e70bee73436e8bb14dbd GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1 chr17_gl000203_random
efc49c871536fa8d79cb0a06fa739722 GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1 chr17_gl000204_random
d22441398d99caf673e9afb9a1908ec5 GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1 chr17_gl000205_random
43f69e423533e948bfae5ce1d45bd3f1 GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1 chr17_gl000206_random
b15d4b2d29dde9d3e4f93d1d0f2cbc9c 18 dna:chromosome chromosome:GRCh37:18:1:78077248:1 chr18
f3814841f1939d3ca19072d9e89f3fd7 GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1 chr18_gl000207_random
1aacd71f30db8e561810913e0b72636d 19 dna:chromosome chromosome:GRCh37:19:1:59128983:1 chr19
aa81be49bf3fe63a79bdc6a6f279abf6 GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1 chr19_gl000208_random
f40598e2a5a6b26e84a3775e0d1e2c81 GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1 chr19_gl000209_random
d75b436f50a8214ee9c2a51d30b2c2cc GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1 chr1_gl000191_random
325ba9e808f669dfeee210fdd7b470ac GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1 chr1_gl000192_random
a0d9851da00400dec1098a9255ac712e 2 dna:chromosome chromosome:GRCh37:2:1:243199373:1 chr2
0dec9660ec1efaaf33281c0d5ea2560f 20 dna:chromosome chromosome:GRCh37:20:1:63025520:1 chr20
2979a6085bfe28e3ad6f552f361ed74d 21 dna:chromosome chromosome:GRCh37:21:1:48129895:1 chr21
851106a74238044126131ce2a8e5847c GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1 chr21_gl000210_random
a718acaa6135fdca8357d5bfe94211dd 22 dna:chromosome chromosome:GRCh37:22:1:51304566:1 chr22
23dccd106897542ad87d2765d28a19a1 4 dna:chromosome chromosome:GRCh37:4:1:191154276:1 chr4
dbb6e8ece0b5de29da56601613007c2a GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1 chr4_gl000193_random
6ac8f815bf8e845bb3031b73f812c012 GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1 chr4_gl000194_random
0740173db9ffd264d728f32784845cd7 5 dna:chromosome chromosome:GRCh37:5:1:180915260:1 chr5
1d3a93a248d92a729ee764823acbbc6b 6 dna:chromosome chromosome:GRCh37:6:1:171115067:1 chr6
618366e953d6aaad97dbe4777c29375e 7 dna:chromosome chromosome:GRCh37:7:1:159138663:1 chr7
5d9ec007868d517e73543b005ba48535 GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1 chr7_gl000195_random
96f514a9929e410c6651697bded59aec 8 dna:chromosome chromosome:GRCh37:8:1:146364022:1 chr8
d92206d1bb4c3b4019c43c0875c06dc0 GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1 chr8_gl000196_random
6f5efdd36643a9b8c8ccad6f2f1edc7b GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1 chr8_gl000197_random
3e273117f15e0a400f01055d9f393768 9 dna:chromosome chromosome:GRCh37:9:1:141213431:1 chr9
868e7784040da90d900d2d1b667a1383 GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1 chr9_gl000198_random
569af3b73522fab4b40995ae4944e78e GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1 chr9_gl000199_random
75e4c8d17cd4addf3917d1703cacaf25 GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1 chr9_gl000200_random
dfb7e7ec60ffdcb85cb359ea28454ee9 GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1 chr9_gl000201_random
7daaa45c66b288847b9b32b964e623d3 GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1 chrUn_gl000211
563531689f3dbd691331fd6c5730a88b GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1 chrUn_gl000212
9d424fdcc98866650b58f004080a992a GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1 chrUn_gl000213
46c2032c37f2ed899eb41c0473319a69 GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1 chrUn_gl000214
5eb3b418480ae67a997957c909375a73 GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1 chrUn_gl000215
642a232d91c486ac339263820aef7fe0 GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1 chrUn_gl000216
6d243e18dea1945fb7f2517615b8f52e GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1 chrUn_gl000217
1d708b54644c26c7e01c2dad5426d38c GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1 chrUn_gl000218
f977edd13bac459cb2ed4a5457dba1b3 GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1 chrUn_gl000219
fc35de963c57bf7648429e6454f1c9db GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1 chrUn_gl000220
3238fb74ea87ae857f9c7508d315babb GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1 chrUn_gl000221
6fe9abac455169f50470f5a6b01d0f59 GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1 chrUn_gl000222
399dfa03bf32022ab52a846f7ca35b30 GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1 chrUn_gl000223
d5b2fc04f6b41b212a4198a07f450e20 GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1 chrUn_gl000224
63945c3e6962f28ffd469719a747e73c GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1 chrUn_gl000225
1c1b2cd1fccbc0a99b6a447fa24d1504 GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1 chrUn_gl000226
a4aead23f8053f2655e468bcc6ecdceb GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1 chrUn_gl000227
c5a17c97e2c1a0b6a9cc5a6b064b714f GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1 chrUn_gl000228
d0f40ec87de311d8e715b52e4c7062e1 GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1 chrUn_gl000229
b4eb71ee878d3706246b7c1dbef69299 GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1 chrUn_gl000230
ba8882ce3a1efa2080e5d29b956568a4 GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1 chrUn_gl000231
3e06b6741061ad93a8587531307057d8 GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1 chrUn_gl000232
7fed60298a8d62ff808b74b6ce820001 GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1 chrUn_gl000233
93f998536b61a56fd0ff47322a911d4b GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1 chrUn_gl000234
118a25ca210cfbcdfb6c2ebb249f9680 GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1 chrUn_gl000235
fdcd739913efa1fdc64b6c0cd7016779 GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1 chrUn_gl000236
e0c82e7751df73f4f6d0ed30cdc853c0 GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1 chrUn_gl000237
131b1efc3270cc838686b54e7c34b17b GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1 chrUn_gl000238
99795f15702caec4fa1c4e15f8a29c07 GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1 chrUn_gl000239
445a86173da9f237d7bcf41c6cb8cc62 GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1 chrUn_gl000240
ef4258cdc5a45c206cea8fc3e1d858cf GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1 chrUn_gl000241
2f8694fc47576bc81b5fe9e7de0ba49e GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1 chrUn_gl000242
cc34279a7e353136741c9fce79bc4396 GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1 chrUn_gl000243
0996b4475f353ca98bacb756ac479140 GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1 chrUn_gl000244
89bc61960f37d94abf0df2d481ada0ec GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1 chrUn_gl000245
e4afcd31912af9d9c2546acf1cb23af2 GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1 chrUn_gl000246
7de00226bb7df1c57276ca6baabafd15 GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1 chrUn_gl000247
5a8e43bec9be36c7b49c84d585107776 GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1 chrUn_gl000248
1d78abec37c15fe29a275eb08d5af236 GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1 chrUn_gl000249
7e0e2e580297b7764e31dbc80c2540dd X dna:chromosome chromosome:GRCh37:X:1:155270560:1 chrX


And here are the unpairable data:
d89517b400226d3b56e753972a7cad67 chr17_ctg5_hap1 1680828
641e4338fa8d52a5b781bd2a2c08d3c3 chr3 198022430
fa24f81b680df26bcfb6d69b784fbe36 chr4_ctg9_hap1 590426
fe71bc63420d666884f37a3ad79f3317 chr6_apd_hap1 4622290
18c17e1641ef04873b15f40f6c8659a4 chr6_cox_hap2 4795371
2a3c677c426a10e137883ae1ffb8da3f chr6_dbb_hap3 4610396
9d51d4152174461cd6715c7ddc588dc8 chr6_mann_hap4 4683263
efed415dd8742349cb7aaca054675b9a chr6_mcf_hap5 4833398
094d037050cad692b57ea12c4fef790f chr6_qbl_hap6 4611984
3b6d666200e72bcc036bf88a4d7e0749 chr6_ssto_hap7 4928567
d2ed829b8a1628d16cbeee88e88e39eb chrM 16571
1e86411d73e6f00a10590f976be01623 chrY 59373566
fdfd811849cc2fadebc929bb925902e5 3 dna:chromosome chromosome:GRCh37:3:1:198022430:1 198022430
c68f52674c9fb33aef52dcf399755519 MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome 16569
1fa3474750af0948bdf97d5a0ee52e51 Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1 59373566
view raw unpairable.txt hosted with ❤ by GitHub


I knew the problem for chrY ( http://www.biostars.org/p/58143/) but not for chr3.. What is the problem for this chromosome ?

Edit: Here are the number of bases for UCSC/chr3:
{T=58760485, G=38670110, A=58713343, C=38653197, N=3225295}
and for g1kv37:
{T=58760485, G=38670110, A=58713343, R=2, C=38653197, M=1, N=3225292}

That's it,



Pierre.

18 July 2013

Running a picard tool in the #KNIME workflow engine

http://www.knime.org/ is "a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting". In this post, I'll show how to invoke an external java program, and more precisely a tool from the picard library from with knime. The workflow: load a list of BAM filenames, invoke SortSam and display the names of the sorted files.

Construct the following workflow:



Edit the FileReader node and load a list of paths to the BAMs


Edit the properties of the java node, in the "Additional Libraries" tab, load the jar of SortSam.jar



Edit the java snippet node, create a new column SORTED_BAM for the output.



and copy the following code:

// Your custom imports:
import net.sf.picard.sam.SortSam;
import java.io.File;
----------------------------------------------------------
// Enter your code here:


File input=new File(c_BAM);

//build the output filename 
out_SORTED = input.getName();
if(!(out_SORTED.endsWith(".sam") || out_SORTED.endsWith(".bam")))
{
 throw new Abort("not a SAM/BAM :"+c_BAM);
}
int dot=out_SORTED.lastIndexOf('.');
out_SORTED=new File(input.getParentFile(),out_SORTED.substring(0, dot)+"_sorted.bam").getPath();

//create a new instance of SortSam
SortSam cmd=new SortSam();

//invoke the instance
int ret=cmd.instanceMain(new String[]{
 "I="+input.getPath(),
 "O="+out_SORTED,
 "SO=coordinate",
 "VALIDATION_STRINGENCY=LENIENT",
 "CREATE_INDEX=true",
 "MAX_RECORDS_IN_RAM=500000"
 });

if(ret!=0)
{
 throw new Abort("SortSam failed with: "+c_BAM+" "+out_SORTED);
}
Execute KNIME, picard runs the job, and get the names of the sorted BAMs:



Edit:

The workflow was uplodaded on MyExperiment at http://www.myexperiment.org/workflows/3654.html.


That's it,

Pierre