Showing posts with label snp. Show all posts
Showing posts with label snp. Show all posts

04 March 2016

Reading a VCF file faster with java 8, htsjdk and java.util.stream.Stream

java 8 streams "support functional-style operations on streams of elements, such as map-reduce transformations on collections". In this post, I will show how I've implemented a java.util.stream.Stream of VCF variants that counts the number of items in dbsnp.

This example uses the java htsjdk API for reading variants.

When using parallel streams, the main idea is to implement a java.util.Spliterator that will split the sequence dictionary (the genome) into a maximum of N (here N=5) parts. Each part will count the number of variants in 1/N genome in its own thread. As we're using an tribble indexed VCF, it's easy to start counting at a given position of the genome.

ContigPos

the class ContigPos defines a chromosome and a position in the whole genome.

class ContigPos {
    /** contig/chromosome index in the dictionary */
    final int tid; 
    /** contig/chromosome name */
    final String contig;
    /** position in the chromosome */
    final int pos;
    (...)
    }

it contains a function to convert its' position to an index in the whole genome using the genome dictionary (htsjdk.samtools.SAMSequenceDictionary) .

long genomicIndex() {
    long n=0L;
    for(int i=0;i< this.tid;++i) {
        n += dict.getSequence(i).getSequenceLength();
        }
    return n + this.pos;
    }

VariantContextSpliterator

VariantContextSpliterator is the main class. It splits the VCF file into parts and implements Spliterator<VariantContext> .

public class VariantContextSpliterator
    implements Closeable,Spliterator<VariantContext> {
(...)

VariantContextSpliterator contains the sequence dictionary and the path to the indexed VCF file

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** genome dictionary */
private final SAMSequenceDictionary dict ;

Each VariantContextSpliterator has is own private VCFileReader and CloseableIterator. Both should be closed when the is no more variant to be read.

/** current VCF File reader */
private VCFFileReader vcfFileReader = null;
/** current variant iterator */
private CloseableIterator<VariantContext> variantIter = null;

Each VariantContextSpliterator has a dedicated genomic region.

/* region start */
private ContigPos begin;
/** region end */
private ContigPos end ;

The very first VariantContextSpliterator will scan :

  • from begin = new ContigPos("chr1",0)
  • to end = new ContigPos("chrY",(size_of_chrY))

We don't want to open to many threads, so we're tracking the number of opened iterators in a AtomicInteger

AtomicInteger nsplits

VariantContextSpliterator.peek()

VariantContextSpliterator.peek() is a method peeking the next Variant in the genomic interval.

We open the VCFFileReader if it was never opened, the number of opened files is incremented.

/** VCF reader was never opened */
if( this.vcfFileReader == null ) {
    /** open the VCF reader */
    this.vcfFileReader = new VCFFileReader(this.vcfFile, true);
    /** open a first iterator on the first chromosome */
    this.variantIter = this.vcfFileReader.query(
            this.begin.contig,
            this.begin.pos,
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */
            );
    /** increase the number of opened streams */
    this.nsplits.incrementAndGet();
    } 

while there is no more variant available on this chromosome , open the next chromosome for reading:

while(!this.variantIter.hasNext()) {
    this.variantIter.close();
    this.variantIter = null;
    if(this.begin.tid == this.end.tid) /* this is the last chromosome */
        {
        close();
        return null;
        }
    else
        {
        this.begin = new ContigPos(this.begin.tid+1, 0);
        this.variantIter = this.vcfFileReader.query(
            this.begin.contig,
            this.begin.pos,
            this.dict.getSequence(this.begin.tid).getSequenceLength() /* whole chromosome size */
            );
        }
    }

get the next variant, update 'begin' with this variant. We close the VCFfileReader if we have reached the end of the genomic window.

/* get the next variant */
final VariantContext ctx = this.variantIter.next();
/* update 'begin' */
this.begin= new ContigPos(ctx.getContig(), ctx.getStart());

/** close if the end of the genomic location was reached */
if((this.begin.tid > this.end.tid) ||
   (this.begin.tid == this.end.tid && this.begin.pos >= this.end.pos) ) {
    close();
    return null;
    }
this._peeked = ctx;
return this._peeked;

VariantContextSpliterator.tryAdvance()

If a remaining variants exists, performs the given action on it, returning true; else returns false.

@Override
public boolean tryAdvance(Consumer<? super VariantContext> action) {
    final VariantContext ctx = this.next();
    if(ctx==null) {
        close();
        return false;
        }
    action.accept(ctx);
    return true;
    }

VariantContextSpliterator.trySplit()

trySplit returns a VariantContextSpliterator covering elements, that will, upon return from this method, not be covered by this VariantContextSpliterator. We can split if the remaining window size is greater than 1Mb and if the number of opened VCFReaderFile is lower than 10.

public Spliterator<VariantContext> trySplit() {
    final VariantContext ctx = this.peek();
    /** no more variant to read, can't split */
    if(ctx==null) return null;
    /** too many opened VCFFile reader, can't split */
    if( this.nsplits.get()>5) return null;

    long start = this.begin.genomicIndex();
    long distance = this.end.genomicIndex() - start;

    /** distance between begin and end is greater than 1Mb */
    while(distance > 1E6 )
        {
        distance = distance/2;
        /** middle genomic index */
        final ContigPos mid = new ContigPos(start + distance);
        
        /** create a new VariantContextSpliterator starting from mid and ending at this.end */
        final VariantContextSpliterator next = new VariantContextSpliterator(this,mid,this.end);
        if(next.peek()==null) {
            next.close();
            continue;
            }
        /* update this.end to 'mid' */
        this.end= mid;
        //System.err.println("split successful:after split "+toString()+" and next="+next);
        return next;
        }

    return null;
    }

Testing

to get a stream , we the static function java.util.stream.StreamSupport.stream is called.

stream() Creates a new sequential or parallel Stream from a Spliterator. The spliterator is only traversed, split, or queried for estimated size after the terminal operation of the stream pipeline commences.

private Stream<VariantContext> stream(boolean parallel) {
    return StreamSupport.stream(new VariantContextSpliterator(this.vcfFile), parallel);
    }

We count the number of variants in dbSNP. We print the duration for stream(), parallelStream() and a standard iterator.

final File vcFile =new File(args[0]);
StreameableVcfFile test= new StreameableVcfFile(vcFile);
long start1 = System.currentTimeMillis();
System.out.println("count;"+test.parallelStream().count());
long end1 = System.currentTimeMillis();
System.out.println(" parallelstream: " + ((end1 - start1) / 1000));



long start2 = System.currentTimeMillis();
System.out.println("count;"+test.stream().count());
long end2 = System.currentTimeMillis();
System.out.println("stream : " + ((end2 - start2) / 1000));


long start3 = System.currentTimeMillis();
CloseableIterator<VariantContext>  r= new VCFFileReader(vcFile).iterator();
int n=0;
while(r.hasNext()) { r.next(); ++n;}
r.close();
long end3 = System.currentTimeMillis();
 System.out.println("count;"+n);
System.out.println("iter : " + ((end3 - start3) / 1000));

Output:

count: 61045456 snps
parallelstream: 80 seconds

count: 61045456 snps
stream : 365 seconds

count: 61045456 snps
iter : 355 seconds

That's it,

Pierre

Source code

26 May 2013

Filtering a VCF with javascript

This is my answer for that question on biostar. I wrote a java program filtering the VCF with the rhino javascript-engine.
I put the code on github: see https://github.com/lindenb/jvarkit#-filtering-vcf-with-javascript-rhino-.

For each variation, the script binds the following variables:

  • variant : the current variation; a org.broadinstitute.variant.variantcontext.VariantContext ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/variantcontext/VariantContext.java )
  • header : the VCF header org.broadinstitute.variant.vcf.VCFHeader ( http://sourceforge.net/p/picard/code/HEAD/tree/trunk/src/java/org/broadinstitute/variant/vcf/VCFHeader.java).
and evaluate the user script. This user script should return '1' or true if the current VCF file should be printed.

For example, you want to keep the variants having at least two samples having a depth (DP) greater that 200.
The script would be:
function myfilterFunction()
    {
    var samples=header.genotypeSamples;
    var countOkDp=0;
    for(var i=0; i< samples.size();++i)
        {
        var sampleName=samples.get(i);
        if(! variant.hasGenotype(sampleName)) continue;
        var genotype = variant.genotypes.get(sampleName);
        if( ! genotype.hasDP()) continue;
        var dp= genotype.getDP();
        if(dp < 200 ) countOkDp++;
        }
    return (countOkDp>2)
    }
myfilterFunction();

Example:

curl -s "https://raw.github.com/jamescasbon/PyVCF/master/vcf/test/gatk.vcf" |\
java -jar  dist/vcffilterjs.jar  -f filter.js |\
grep -v "##"


#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BLANK NA12878 NA12891 NA12892 NA19238 NA19239 NA19240
chr22 42526449 . T A 151.47 . AC=1;AF=0.071;AN=14;BaseQRankSum=2.662;DP=1226;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=41.2083;MQ=240.47;MQ0=0;MQRankSum=0.578;QD=4.89;ReadPosRankSum=3.611 GT:AD:DP:GQ:PL 0/1:23,8:31:99:190,0,694 0/0:188,0:190:99:0,478,5376 0/0:187,0:187:99:0,493,5322 0/0:247,0:249:99:0,634,6728 0/0:185,0:185:99:0,487,5515 0/0:202,0:202:99:0,520,5857 0/0:181,1:182:99:0,440,5362
chr22 42526634 . T C 32.60 . AC=1;AF=0.071;AN=14;BaseQRankSum=1.147;DP=1225;DS;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=50.0151;MQ=240.65;MQ0=0;MQRankSum=1.151;QD=1.30;ReadPosRankSum=1.276 GT:AD:DP:GQ:PL 0/1:21,4:25:71:71,0,702 0/0:187,2:189:99:0,481,6080 0/0:233,0:233:99:0,667,7351 0/0:230,0:230:99:0,667,7394 0/0:174,1:175:99:0,446,5469 0/0:194,2:196:99:0,498,6239 0/0:174,0:175:99:0,511,5894
chr22 42527793 rs1080989 C T 3454.66 . AC=2;AF=0.167;AN=12;BaseQRankSum=-3.007;DB;DP=1074;DS;Dels=0.01;FS=0.000;HRun=1;HaplotypeScore=75.7865;MQ=209.00;MQ0=0;MQRankSum=3.014;QD=9.36;ReadPosRankSum=0.618 GT:AD:DP:GQ:PL ./. 0/1:72,90:162:99:1699,0,1767 0/1:103,96:202:99:1756,0,2532 0/0:188,0:188:99:0,526,5889 0/0:160,0:160:99:0,457,4983 0/0:197,0:198:99:0,544,6100 0/0:156,0:156:99:0,439,5041

That's it,

Pierre

27 September 2012

Playing with the #Ensembl REST API: filtering a VCF with javascript

The new Ensembl REST API was announced today: "We are pleased to announce the beta release of our programming language agnostic REST API, for Release 68 data, at http://beta.rest.ensembl. Our initial release provides access to:

  • Sequences (genomic, cDNA, CDS and protein)
  • VEP (Variant Effect Predictor)
  • Homologies
  • Gene Trees
  • Assembly and coordinate mapping."

In the current post I will filter a VCF file with this API and javascript.

The VCF

Our initial file is the following VCF:
##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . PASS DP=65
1 69427 rs148502021 T A . PASS DP=1557
1 69452 rs142004627 A G . PASS DP=155
1 69475 rs148502021 C T . PASS DP=231
1 865583 rs148711625 A G . PASS DP=231
1 866460 rs148884928 A G . PASS DP=23
1 866461 . G A . PASS DP=24

The javascript

The VCF will be read on the standard input using the following script and the Rhino JS engine. The script reads the VCF and for each substitution, it calls the Ensembl REST API, parses the JSON response and return the transcript identifier if the mutation is a missense_variant or a polyphen_prediction="probably damaging":
importPackage(java.io);
importPackage(java.lang);



function sleep(milliseconds)
  {
  /* hacked from http://www.phpied.com/sleep-in-javascript/ */
  var start = new Date().getTime();
  for (var i = 0; i < 1e7; i++) {
    if ((new Date().getTime() - start) > milliseconds){
      break;
      }
    }
  }


function damagingTranscript(json)
 {
 for(var d in json.data)
  {
  var transcripts=json.data[d].transcripts;
 
  if(!transcripts) return null;
  for(var t in transcripts)
   {
   var transcript=transcripts[t];
   for(var a in transcript.alleles)
    {
    var allele=transcript.alleles[a];
    if(allele.polyphen_prediction=="probably damaging" ||
       allele.consequence_terms.indexOf("missense_variant")!=-1 )
        {
        return transcript.transcript_id;
        }
    }
   }
  }
 return null;
 }

var baseRegex=new RegExp(/^[ATGCatgc]$/);



var stdin = new java.io.BufferedReader( new java.io.InputStreamReader(java.lang.System['in']) );
var line;
while((line=stdin.readLine())!=null)
 {
 if(line.startsWith("#"))
  {
  print(line); continue;
  }
 var tokens=line.split("\t");
 var chrom=tokens[0];
 var pos= parseInt(tokens[1]);
 var ref= tokens[3];
 var alt= tokens[4];
 if(!baseRegex.test(ref)) continue;
 if(!baseRegex.test(alt)) continue;
 
 sleep(200);
 var url="http://beta.rest.ensembl.org/vep/human/"+
  chrom+":"+pos+"-"+pos+"/"+alt+
  "/consequences?content-type=application/json";
 var jsonStr=readUrl(url,"UTF-8");
 var json=eval("("+jsonStr+")");
 var transcript=damagingTranscript(json);
 if(transcript==null) continue;
 tokens[7]+=(";TRANSCRIPT="+transcript);
 print(tokens.join('\t'));
 }

As an example, here is the response from the ENSEMBL server for 1:866460 A/G:

http://beta.rest.ensembl.org/vep/human/1:866460-866460/G/consequences?content-type=application/json
{
    "data": [
        {
            "location": {
                "coord_system": "chromosome",
                "name": "1",
                "strand": 1,
                "end": "866460",
                "start": "866460"
            },
            "hgvs": {
                "G": "1:g.866460C>G"
            },
            "transcripts": [
                {
                    "translation_stable_id": "ENSP00000411579",
                    "intron_number": null,
                    "cdna_end": 385,
                    "translation_end": 99,
                    "exon_number": "4/7",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000420190",
                    "cdna_start": 385,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000411579.1:p.Ala99Gly",
                            "sift_score": 0.04,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000420190.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000393181",
                    "intron_number": null,
                    "cdna_end": 356,
                    "translation_end": 99,
                    "exon_number": "4/5",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000437963",
                    "cdna_start": 356,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000393181.1:p.Ala99Gly",
                            "sift_score": 0.03,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000437963.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000342313",
                    "intron_number": null,
                    "cdna_end": 379,
                    "translation_end": 99,
                    "exon_number": "4/14",
                    "is_canonical": 1,
                    "transcript_id": "ENST00000342066",
                    "cdna_start": 379,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000342313.3:p.Ala99Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000342066.3:c.296C>G"
                        }
                    },
                    "translation_start": 99,
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "ccds": "CCDS2.2",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000349216",
                    "intron_number": null,
                    "cdna_end": 67,
                    "translation_end": 23,
                    "exon_number": "2/12",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000341065",
                    "cdna_start": 67,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 68,
                    "translation_start": 23,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.008,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000349216.4:p.Ala23Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000341065.4:c.67C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 68,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                }
            ]
        }
    ]
}

Invoking RHINO, filtering the VCF

$ cat input.vcf | rhino -f restensembl.js
  
##fileformat=VCFv4.0
##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69427 rs148502021 T A . PASS DP=1557;TRANSCRIPT=ENST00000335137
1 69452 rs142004627 A G . PASS DP=155;TRANSCRIPT=ENST00000335137
1 69475 rs148502021 C T . PASS DP=231;TRANSCRIPT=ENST00000335137
1 865583 rs148711625 A G . PASS DP=231;TRANSCRIPT=ENST00000420190
1 866460 rs148884928 A G . PASS DP=23;TRANSCRIPT=ENST00000420190

Limitations for the Ensembl REST API:




That's it,

Pierre

31 July 2011

Storing some SNPs using the leveldb library. My notebook.

In this post I'll describe how I've used the leveldb API, a C++ key-value database, to store some SNPs (key=rs,value=sequence). "LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.". A benchmark for this engine is available here: http://code.google.com/p/leveldb/source/browse/trunk/doc/benchmark.html.

Download & install

$ svn checkout http://leveldb.googlecode.com/svn/trunk/ leveldb-read-only
$ cd leveldb-read-only/
$ make

Open & close a leveldb database

#include "leveldb/db.h"
(...)
leveldb::DB* db=NULL;
leveldb::Options options;
options.comparator=&rs_comparator;/* custom comparator for ordering the keys, see later */
options.create_if_missing = true;
clog << "[LOG]opening database" << endl;
leveldb::Status status = leveldb::DB::Open(options,db_home, &db);
//check status
if (!status.ok())
{
cerr << "cannot open " << db_home << " : " << status.ToString() << endl;
return EXIT_FAILURE;
}
(...)
/* use the database */
(...)
delete db;

A custom comparator for ordering the keys

Here, the keys are the rs-ids ordered on their numerical value. Both keys and values have a type "leveldb::Slice".
class RsComparator : public leveldb::Comparator
{
private:
/* parses the rsId: rs[0-9]+ */
int rs(const leveldb::Slice& s) const
{
int n=0;
for(size_t i=2;i< s.size();++i)
{
n=n*10+s[i]-'0';
}
return n;
}
public:
virtual int Compare(const leveldb::Slice& a, const leveldb::Slice& b) const
{
return rs(a)-rs(b);
}
(...)
}
(...)
RsComparator rs_comparator;
options.comparator=&rs_comparator;

Inserting a FASTA sequence


(...)
std::string name;
std::string sequence;
(...)
leveldb::Status status = db->Put(leveldb::WriteOptions(), name, sequence);
if(!status.ok())
{
cerr << "Cannot insert "<< name << " "
<< status.ToString() << endl;
return EXIT_FAILURE;
}

Searching for a rs###

(...)
std::string name(argv[optind]);
std::string sequence;
(...)
leveldb::Status status = db->Get(leveldb::ReadOptions(),seqname, &sequence);
if(!status.ok())
{
cerr << "Cannot find " << seqname<< " in " << db_home << endl;
continue;
}
printFasta(seqname,sequence);
(...)

Dumping all the SNPs using an iterator

(...)
leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next())
{
printFasta(it->key(),it->value());
}
delete it;
(...)

Examples

Reading the SNPs:
$ curl -s "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/rs_chMT.fas.gz" |\
./rsput -D snp.db

[LOG]opening database
[LOG] added rs8896
[LOG] added rs8936
[LOG] added rs9743
(...)
[LOG]closing database

$ du -h snp.db
336K snp.db
Dump the snps

$ ./rsget  -D snp.db

[LOG]opening database
>rs8896
GGTGTTGGTTCTCTTAATCTTTAACTTAAAAGGTTAATGCTAAGTTAGCTTTACAGTGGG
CTCTAGAGGGGGTAGAGGGGGTGYTATAGGGTAAATACGGGCCCTATTTCAAAGATTTTT
AGGGGAATTAATTCTAGGACGATGGGCATGAAACTGTGGTTTGCTCCACAGATTTCAGAG
CATT
>rs8936
ACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCATTATTCCTAGAACCAGGCG
ACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTA
TAATAATTACATCACAAGACGTCTTGCACTCATGAGCTGTCCCCACATTAGGCTTAAAAA
CAGATGCAATTCCCGGACGTHTAAACCAAACCACTTTCACCGCTACACGACCGGGGGTAT
ACTACGGTCAATGCTCTGAAATCTGTGGAGCAAACCACAGTTTCATGCCCATCGTCCTAG
AATTAATTCCCCTAAAAATCTTTGAAATAGGGCCCGTATTTACCCTATAGCACCCCCTCT
ACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAAC
>rs9743
CCATGTGATTTCACTTCCACTCCATAACGCTCCTCATACTAGGCCTACTAACCAACACAC
TAACCATATACCAATGATGNCGCGATGTAACACGAGAAAGCACATACCAAGGCCACCACA
CACCACCTGTCCAAAAAGGCCTTCGATACGGGATAATCCTATTTATTACCTCAGAANTTT
TTTTCTTCGCAGGATTTTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAAN
(...)
[LOG]closing database
Search for some SNPs:
$ ./rsget -D snp.db rs78894381 rs72619361 rs25
[LOG]opening database
[LOG]searching for rs78894381
>rs78894381
CTACTAATCTCATCAACACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTA
ACCCCATACCCCGAACCAACCAAACCCCAAAGACACCCCCNCACAGTTTATGTAGCTTAC
CTCCTCAAAGCAATACACTGAAAATGTTTAGACGGGCTCACATCACCCCATAAACAAATA
GGTTTGGTCCTAGCCTTTCTA
[LOG]searching for rs72619361
>rs72619361
ATGCATTTGGTATTTTAATCTGGGGGGTGTGCACGCGATAGCATTGTGAAACGCTGGCCC
CAGAGCACCCTATGTCGCAGTGTCTGTCTTTGATTCCTGCCYCATCCCATTATTGATCAC
ACCTACATTCAATATCCCAGGCGAGCATACCTATCACAAGGTGTTAATTAATTAATGCTT
GTAGGACATAACAATCAGTAAAC
[LOG]searching for rs25
Cannot find rs25 in snp.db
[LOG]closing database

Source code






That's it,

Pierre

26 July 2011

A mysql full-text parser searching for some SNPs

A mysql full-text parser server plugin can be used to replace or modify the built-in full-text parser.
This post a full-text parser plugin named bioparser that extracts the rs#### id from a text.

The source

The source of this plugin is available on github at:
https://github.com/lindenb/bioudf/blob/master/src/bioparser.c.

The work horse of the plugin is a simple function bioparser_parse scanning the SQL query or the TEXT. Each time a word starting with "rs" and followed by one or more number is found, it is added to the list of words to be find:
static int bioparser_parse(MYSQL_FTPARSER_PARAM *param)
{
char *curr=param->doc;
const char *begin=param->doc;
const char *end= begin + param->length;

param->flags = MYSQL_FTFLAGS_NEED_COPY;
while(curr+2<end)
{
if(tolower(*curr)=='r' &&
tolower(*(curr+1))=='s' &&
isdigit(*(curr+2)) &&
(curr==begin || IS_DELIM(*(curr-1) ) )
)
{
char* p=curr+2;
while(p!=end && isdigit(*p))
{
++p;
}
if(p==end || IS_DELIM(*p))
{
my_add_word(param,curr,p-curr);
}
curr=p;
}
else
{
curr++;
}
}

return 0;
}

Install the Plugin


mysql> INSTALL PLUGIN bioparser SONAME 'bioparser.so';
Query OK, 0 rows affected (0.00 sec)

mysql> show plugins;
+-----------------------+----------+--------------------+--------------+---------+
| Name | Status | Type | Library | License |
+-----------------------+----------+--------------------+--------------+---------+
(...)
| partition | ACTIVE | STORAGE ENGINE | NULL | GPL |
| bioparser | ACTIVE | FTPARSER | bioparser.so | GPL |
+-----------------------+----------+--------------------+--------------+---------+
21 rows in set (0.00 sec)

Invoke Plugin


create a table that will use the plugin:
mysql> create table pubmed(
abstract TEXT,
FULLTEXT (abstract) WITH PARSER bioparser
) ENGINE=MyISAM;

Insert some abstracts.
mysql> insert into pubmed(abstract) values("A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP.");
Query OK, 1 row affected (0.00 sec)

(...)

mysql> select abstract from pubmed;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| abstract |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| A predictive role in radiation pneumonitis (RP) development was observed for the LIG4 SNP rs1805388 (adjusted hazard ratio, 2.08; 95% confidence interval, 1.04-4.12; P = .037 for the CT/TT genotype vs the CC genotype). In addition, men with the TT genotype of the XRCC4 rs6869366 SNP and women with AG + AA genotypes of the XRCC5 rs3835 SNP also were at increased risk of developing severe RP. |
| Nonhomologous end joining (NHEJ) is a pathway that repairs DNA double-strand breaks (DSBs) to maintain genomic stability in response to irradiation. The authors hypothesized that single nucleotide polymorphisms (SNPs) in NHEJ repair genes may affect clinical outcomes in patients with nonsmall cell lung cancer (NSCLC) who receive definitive radio(chemo)therapy. |
| The authors genotyped 5 potentially functional SNPs-x-ray repair complementing defective repair in Chinese hamster cells 4 (XRCC4) reference SNP (rs) number rs6869366 (-1394 guanine to thymine [-1394G?T] change) and rs28360071 (intron 3, deletion/insertion), XRCC5 rs3835 (guanine to adenine [G?A] change at nucleotide 2408), XRCC6 rs2267437 (-1310 cytosine to guanine [C?G) change], and DNA ligase IV (LIG4) rs1805388 (threonine-to-isoleucine change at codon 9 [T9I])-and estimated their associations with severe radiation pneumonitis (RP) (grade ?3) in 195 patients with NSCLC. |
| The current results indicated that NHEJ genetic polymorphisms, particularly LIG4 rs1805388, may modulate the risk of RP in patients with NSCLC who receive definitive radio(chemo)therapy. Large studies will be needed to confirm these findings. |
| The repair of DNA double-strand breaks (DSBs) is the major mechanism to maintain genomic stability in response to irradiation. We hypothesized that genetic polymorphisms in DSB repair genes may affect clinical outcomes among non-small cell lung cancer (NSCLC) patients treated with definitive radio(chemo)therapy. |
| We also found that RAD51 -135G>C and XRCC2 R188H SNPs were independent prognostic factors for overall survival (adjusted HR?=?1.70, 95% CI, 1.14-2.62, P?=?0.009 for CG/CC vs. GG; and adjusted HR?=?1.70; 95% CI, 1.02-2.85, P?=?0.043 for AG vs. GG, respectively) and that the SNP-survival association was most pronounced in the presence of RP. |
| A total of 291 patients (145 male/146 female, mean age (± S.D.) 52.2 (± 13.1) years) with PsA were examined clinically, by standard laboratory tests and their DNA was genotyped for the SNP rs2476601 (PTPN22 +1858 C/T). Allelic frequencies were determined and compared with 725 controls. |
| this is a test rs2476601, rs1805388, rs3835 and rs25 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

Now, test the plugin:

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("DNA double-strand) as SCORE
from pubmed group by 2 HAVING SCORE!=0;
Empty set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25") as SCORE
from pubmed group by 2 HAVING SCORE!=0;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 1.8603347539901733 |
+---------------------------------------------+--------------------+
1 row in set (0.01 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs2476601 rs1805388 rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| this is a test rs2476601, rs1805388, rs3... | 1.0502985715866089 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

mysql> select
concat(left(abstract,40),"...") as ABSTRACT,
match(abstract) against("rs25,rs2476601,rs1805388,rs6869366") as SCORE
from pubmed group by 2 HAVING SCORE!=0 order by 2 desc;
+---------------------------------------------+--------------------+
| ABSTRACT | SCORE |
+---------------------------------------------+--------------------+
| this is a test rs2476601, rs1805388, rs3... | 2.9106333255767822 |
| A total of 291 patients (145 male/146 fe... | 1.086121916770935 |
| A predictive role in radiation pneumonit... | 1.0619741678237915 |
| The authors genotyped 5 potentially func... | 1.0388768911361694 |
+---------------------------------------------+--------------------+
4 rows in set (0.00 sec)

uninstall the plugin


mysql> UNINSTALL PLUGIN bioparser;
Query OK, 0 rows affected (0.00 sec)


That's it,

Pierre

09 July 2011

Storing SNPs in a HDF5 file: my notebook

I discovered the benefits of using twitter in 2008, the day Deepak Singh replied to one of my tweets related to the storage of a large number of genotypes.





Since that day, I've tried to use the HDF5 library, without any success (there's a large disheartening documentation/API on the HDF5 site and the API seems to be only used by a small number of geeks). Furthermore, HDF5 is a technology used by the IGV genome browser.

In that post, I'll describe a C program loading a set of SNPs defined by the following C structure:
typedef struct structSnp
{
char rsId[RS_LENGTH];//rs##
char chrom[CHROM_LENGTH];//chromosome name
int chromStart;//genomic start index
int chromEnd;//genomic end index
}Snp;
The SNPs will be read from stdin. Four columns are expected: rs-id, chrom, chromStart and chromEnd. Here is the command line I used to get a small input file:
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz" |\
gunzip -c |\
cut -d ' ' -f 2-5 |\
head -n 10000 > sample.tsv

Source code


The C program is described below. I hope my comments will make the code readable.

The Makefile



Compile and run

gcc -Wall -o a.out -I  /path/to/hdf5/include -L /path/to/hdf5/lib dbsnp2hdf5.c  -lhdf5
./a.out < sample.tsv

Dump the data


At the en, the program creates a structured binary file containing our SNPs. The HDF5 utility h5dump can be used to display the data as text:
$ h5dump storage.h5

HDF5 "storage.h5" {
GROUP "/" {
GROUP "variations" {
DATASET "dbSNP" {
DATATYPE H5T_COMPOUND {
H5T_STRING {
STRSIZE 13;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "rs";
H5T_STRING {
STRSIZE 7;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
} "chrom";
H5T_STD_I32LE "chromStart";
H5T_STD_I32LE "chromEnd";
}
DATASPACE SIMPLE { ( 10000 ) / ( H5S_UNLIMITED ) }
DATA {
(0): {
"rs112750067",
"chr1",
10326,
10327
},
(1): {
"rs56289060",
"chr1",
10433,
10433
},
(2): {
"rs112155239",
"chr1",
10439,
10440
},
(3): {
"rs112766696",
"chr1",
10439,
10440
},
(...)
450425,
450426
}
}
ATTRIBUTE "rdfs:comment" {
DATATYPE H5T_STRING {
STRSIZE 15;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "My list of SNPs"
}
}
}
}
}
}


That's it !
Pierre