tag:blogger.com,1999:blog-146882522024-03-19T04:00:23.416+01:00YOKOFAKUNPierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.comBlogger583125tag:blogger.com,1999:blog-14688252.post-73515618070922093702017-01-15T13:17:00.000+01:002017-01-15T13:17:30.130+01:00Creating a custom GATK Walker (GATK 3.6) : my notebook
This is my notebook for creating a custom engine in GATK.
Description
I want to read a VCF file and to get a table of category/count. Something like this:
HAVE_ID
TYPE
COUNT
YES
SNP
123
NO
SNP
3
NO
INDEL
13
Class Category
I create a class Category describing each row in the table. It's just a List of Strings
static class Category
implements Comparable<Category>
Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-78637879220458013962016-10-27T12:40:00.000+02:002016-10-27T12:40:06.642+02:00Hello WDL ( Workflow Description Language )This is a quick note about my first WDL workflow (Workflow Description Language) https://software.broadinstitute.org/wdl/.
As a Makefile, my workflow would be the following one:
NAME?=world
$(NAME)_sed.txt : $(NAME).txt
sed 's/Hello/Goodbye/' $< > $@
$(NAME).txt:
echo "Hello $(NAME)" > $@
Executed as:$ make NAME=WORLD
echo "Hello WORLD" > WORLD.txt
sed 's/Hello/Goodbye/' WORLD.txt Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-48444115361791809252016-09-22T11:10:00.000+02:002016-09-22T11:10:46.526+02:00Writing a Custom ReadFilter for the GATK, my notebook.
The GATK contains a set of predefined read filters that "filter or transfer incoming SAM/BAM data files":BadCigar
BadMate
CountingRead
DuplicateRead
FailsVendorQualityCheck
LibraryRead
MalformedRead
MappingQuality
MappingQualityUnavailable
(...)
With the help of the modular architecture of the GATK, it's possible to write a custom ReadFilter. In this post I'll write a ReadFilter that removes thePierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-68696373788630933412016-09-09T16:18:00.001+02:002016-09-09T16:18:31.394+02:00Playing with #magicblast, the #NCBI Short read mapper. My notebookNCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.
Looks pretty cool. Perhaps once again the answer to all bfx questions will be BLAST RE https://t.co/4D5e9QQnrb pic.twitter.com/bwW3y0yl2n- Mick Watson (@BioMickWatson) September 9, 2016
Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).
First, here is an extract of the manual for Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com1tag:blogger.com,1999:blog-14688252.post-39879545564112741682016-05-27T15:30:00.000+02:002016-05-27T15:30:44.179+02:00pubmed: extracting the 1st authors' gender and location who published in the Bioinformatics journal.In this post I'll get some statistics about the 1st authors in the "Bioinformatics" journal from pubmed. I'll extract their genders and locations.
I'll use some tools I've already described some years ago but I've re-written them.
Downloading the dataTo download the paper published in Bioinformatics, the pubmed/entrez query is '"Bioinformatics"[jour]'.
I use pubmeddump to download all those Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-70490147038531420322016-05-21T14:52:00.000+02:002016-05-21T14:52:14.757+02:00Playing with the @ORCID_Org / @ncbi_pubmed graph. My notebook."ORCID provides a persistent digital identifier that distinguishes you from every other researcher and, through integration in key research workflows such as manuscript and grant submission, supports automated linkages between you and your professional activities ensuring that your work is recognized. "I've recently discovered that pubmed now integrates ORCID identfiers.
and so it begins ! :-D @Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-51282288047098991422016-05-17T13:29:00.001+02:002016-05-17T13:29:50.789+02:00finding new intron-exon junctions using the public Encode RNASeq dataI've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.
The following command is used to build the list of BAMs:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-49620642686137460852016-03-04T17:16:00.000+01:002016-03-04T17:16:29.570+01:00Now in picard: two javascript-based tools filtering BAM and VCF files.
SamJS and VCFFilterJS are two tools I wrote for jvarkit. Both tools use the embedded java javascript engine to filter BAM or VCF file.
To get a broader audience, I've copied those functionalities to Picard in 'FilterSamReads' and 'FilterVcf'.
FilterSamReadsFilterSamReads filters a SAM or BAM file with a javascript expression using the java javascript-engine.
The script puts the following Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com1tag:blogger.com,1999:blog-14688252.post-11132063582676861422016-03-04T16:41:00.001+01:002016-03-04T16:41:17.405+01:00Reading a VCF file faster with java 8, htsjdk and java.util.stream.Streamjava 8 streams "support functional-style operations on streams of elements, such as map-reduce transformations on collections". In this post, I will show how I've implemented a java.util.stream.Stream of VCF variants that counts the number of items in dbsnp.This example uses the java htsjdk API for reading variants.When using parallel streams, the main idea is to implement a Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0Centre-Ville, Nantes, France47.209384873179857 -1.55354261747561447.208036373179858 -1.556064117475614 47.210733373179856 -1.5510211174756139tag:blogger.com,1999:blog-14688252.post-5195044451644839762016-02-24T12:53:00.000+01:002016-02-24T12:53:36.774+01:00Registering a tool in the @ELIXIREurope regisry using XML, XSLT, JSON and curl. My notebook.The Elixir Registry / pmid:26538599 "A portal to bioinformatics resources world-wide. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools."In this post, I will describe how I've used the bio.tools API to register Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-58357881023297375952015-12-05T00:18:00.001+01:002015-12-05T00:18:58.494+01:00Happy birthday my blog. You are now ten-year-old.Happy birthday my blog. You are now 10-year-old.
Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com2tag:blogger.com,1999:blog-14688252.post-18388917654952127982015-12-03T10:20:00.000+01:002015-12-03T10:20:38.973+01:00GATK-UI : a java-swing interface for the Genome Analysis Toolkit.I've just pushed GATK-UI, a java swing interface for the Genome Analysis Toolkit GATK at https://github.com/lindenb/gatk-ui. This tool is also available as a WebStart/JNLP application.
Screenshot
Why did you create this tool ?Some non-bioinformatician collaborators often want some coverage data for a defined set of BAM, for a specific region...Did you test every tool ?NOHow did you create an Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-38581773021514649362015-07-13T15:44:00.000+02:002015-07-13T15:44:18.151+02:00Playing with #Docker , my notebookThis post is my notebook about docker after we had a very nice introduction about docker by François Moreews (INRIA/IRISA, Rennes). I've used docker today for the first time, my aim was just to create an image containing https://github.com/lindenb/verticalize, a small tool I wrote to verticalize text files.
Install dockeryou hate running this kind of command-lines, aren't you ?
$ wget -qO- Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-55368391657641759162015-06-29T11:30:00.000+02:002015-06-29T11:31:38.952+02:00A BLAST to SAM converter.Some times ago, I've received a set of Ion-Torrent /mate-reads with a poor quality. I wasn't able to align much things using bwa. I've always wondered if I could get better alignments using NCBI-BLASTN (short answer: no) . That's why I asked guyduche, my intership student to write a C program to convert the output of blastn to SAM. His code is available on github at :https://github.com/guyduche/Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com8tag:blogger.com,1999:blog-14688252.post-74439667067747642402015-06-18T13:07:00.000+02:002015-06-18T13:07:06.424+02:00Playing with the #GA4GH schemas and #Avro : my notebookAfter watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-63677040222084756652015-05-07T21:47:00.000+02:002015-05-07T21:47:59.133+02:00Monitoring a java application with mbeans. An example with samtools/htsjdk."A MBean is a Java object that follows the JMX specification. A MBean can represent a device, an application, or any resource that needs to be managed. The JConsole graphical user interface is a monitoring tool that complies to the JMX specification.". In this post I'll show how I've modified the sources of the htsjdk library to monitor the java program reading a VCF file from the Exac server. Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-66092435073224946662015-05-05T17:24:00.001+02:002015-05-05T17:24:16.688+02:00Playing with hadoop/mapreduce and htsjdk/VCF : my notebook.The aim of this test is to get a count of each type of variant/genotypes in a VCF file using Apache Hadoop and the java library for NGS htsjdk. My source code is available at: https://github.com/lindenb/hadoop-sandbox/blob/master/src/main/java/com/github/lindenb/hadoop/Test.java.
First, and this is my main problem, I needed to create a class 'VcfRow' that would contains the whole data about a Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com1tag:blogger.com,1999:blog-14688252.post-91400080013193212022015-02-28T17:01:00.000+01:002015-02-28T17:01:22.611+01:00Integrating a java program in #usegalaxy.This is my notebook for the integration of java programs in https://usegalaxy.org/ .
create a directory for your tools under ${galaxy-root}/tools
mkdir ${galaxy-root}/tools/jvarkit
put all the required jar files and the XML files describing your tools (see below) in this new directory:$ ls ${galaxy-root}/tools/jvarkit/
commons-jexl-2.1.1.jar
groupbygene.jar
htsjdk-1.128.jar
vcffilterjs.jar
Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-69323187205709762152015-02-22T17:54:00.000+01:002015-02-22T17:54:01.196+01:00Drawing a Manhattan plot in SVG using a GWAS+XML model.On friday, I saw my colleague @b_l_k starting writing SVG+XML code to draw a Manhattan plot. I told him that a better idea would be to describe the data using XML and to transform the XML to SVG using XSLT.
So, let's do this. I put the XSLT stylesheet on github at https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/manhattan.xsl . And the model of data would look like this (I Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-50388383768280922822015-02-18T20:26:00.000+01:002015-02-18T20:26:46.127+01:00Automatic code generation for @knime with XSLT: An example with two nodes: fasta reader and writer.
KNIME is a java+eclipse-based graphical workflow-manager.
Biologists in my lab often use this tool to filter VCFs or other tabular data. A software Development kit (SDK) is provided to build new nodes. My main problem with this SDK is, that you need to write a large number of similar files and you also have to interact with a graphical interface. I wanted to automatize the generation of java Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-35479656179125379582015-02-02T12:34:00.000+01:002015-02-02T12:34:50.604+01:00Listing the 'Subject' Sequences in a BLAST database using the NCBI C++ toolbox. My notebook.In my previous post (http://plindenbaum.blogspot.com/2015/01/filtering-fasta-sequences-using-ncbi-c.html) I've built an application filtering FASTA sequences using theNCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna write a tool listing the 'subject' sequences in a BLAST database.This new application ListBlastDatabaseContent takes only one argument '-db', the Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com2tag:blogger.com,1999:blog-14688252.post-63982893602634544922015-01-30T11:07:00.000+01:002015-01-30T11:07:27.296+01:00Filtering Fasta Sequences using the #NCBI C++ API. My notebook.In my previous post (http://plindenbaum.blogspot.com/2015/01/compiling-c-hello-world-program-using.html) I've built a simple "Hello World" application using theNCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/). Here, I'm gonna to extend the code in order to create a program filtering FASTA sequences on their sizes.This new application FastaFilterSize needs three new Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-88526594606331459402015-01-29T13:09:00.000+01:002015-01-29T13:09:00.313+01:00Compiling a C++ 'Hello world' program using the #NCBI C++ toolbox: my notebook.This post is my notebook for compiling a simple C++ application using the NCBI C++ toolbox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/).This application prints 'Hello world' and takes two arguments:'-o' to specificiy the output filename (default is standard output)
'-n' to set the name to be printed (default: "Word !")
The code I used is the one containing in the distribution of Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-16941456378519438812014-12-05T21:08:00.000+01:002014-12-05T21:08:00.352+01:00Divide-and-conquer in a #Makefile : recursivity and #parallelism.This post is my notebook about implementing a divide-and-conquer strategy in GNU make.Say you have a list of 'N' VCFs files. You want to create a list of:common SNPs in vcf1 and vcf2
common SNPs in vcf3 and the previous list
common SNPs in vcf4 and the previous list
(...)
common SNPs in vcfN and the previous list
Yes, I know I can do this using:grep -v '^#' f.vcf|cut -f 1,2,4,5 | sort | uniq
Pierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com0tag:blogger.com,1999:blog-14688252.post-31316211932456560412014-12-04T20:40:00.001+01:002014-12-04T20:40:39.565+01:00XML+XSLT = #Makefile -based #workflows for #bioinformaticsI've recently read some conversations on Twitter about Makefile-based bioinformatics workflows. I've suggested on biostars.org (Standard simple format to describe a bioinformatics analysis pipeline) that a XML file could be used to describe a model of data and XSLT could transform this model to a Makefile-based workflow. I've already explored this idea in a previous post (Generating a pipeline ofPierre Lindenbaumhttp://www.blogger.com/profile/13765837643388003852noreply@blogger.com1