Showing posts with label gene. Show all posts

17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

I've been asked to look for some new / suspected / previously uncharacterized intron-exon junctions in public RNASeq data.
I've used the BAMs under http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/.

The following command is used to build the list of BAMs:

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/" |\
tr ' <>"' "\n" | grep -F .bam | grep -v bai | sort | uniq | sed 's/.bam$//' | sed 's/$/ \\/' 

wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dAlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R1x75dSplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep1V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il200SplicesRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400AlignsRep2V2 \
wgEncodeCaltechRnaSeqGm12878R2x75Il400SplicesRep2V2 \
(...)

This list is inserted as a list named SAMPLES a Makefile.

For each BAM, we use samtools to retrieve the reads in the region(s)) of interest. The reads are then filtered with samjs (https://github.com/lindenb/jvarkit/wiki/SamJS) to only keep the reads carrying an intron-exon junction at the desired location(s). Basically, the javascript-based filter loops over the CIGAR string of the read, computes the genomic interval skipped when the cigar operator is a deletion or a skipped region/intron. The read is printed if it describes the new intron-exon junction.

All in one:

That's it,

Pierre

20 December 2012

RDF/Jena: a simple extension for XSLT/XALAN. Testing with NCBI-Gene

In a previous post, I've shown that the XALAN XSLT engine can be extended with custom function returning a DOM Document that will be used by the xslt-stylesheet. Here, I'll create an extension for XALAN getting some RDF statements from a Jena/RDF model. The RDF model will be loaded in memory but one can imagine to use a persistent model ( TDB or SDB). I'll download a record from NCBI-gene, transform it to html and use the disease-ontology database as RDF to annotate it.

A Gene record is downloaded as XML from NCBI gene:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=4853&retmode=xml" > notch2.html

The disease ontology is downloaded as RDF/XML:

curl -odoid.owl "http://www.berkeleybop.org/ontologies/doid.owl"

The XSLT Stylesheet

The stylesheet declares the extension jena, loads the RDF model ("$model"), searches for the OMIM identifiers in the Gene record and loads the RDF statements related to that OMIM-ID.
For example the following xpath expression:

jena:query(
   $model,
   $doiid,
   'http://www.geneontology.org/formats/oboInOwl#hasExactSynonym',
   ''
   )

returns a rdf/XML document containing the RDF statements having a subject=$doiid, a property "http://www.geneontology.org/formats/oboInOwl#hasExactSynonym" and any object.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Statement>
    <rdf:subject rdf:resource="http://purl.obolibrary.org/obo/DOID_0050721"/>
    <rdf:predicate rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
    <rdf:object>Phosphoserine phosphatase deficiency</rdf:object>
  </rdf:Statement>
</rdf:RDF>

The stylesheet:

The Java code

This is the java extension: the constructor loads the RDF model in memory. The function query(..) returns a RDF/XML document matching the query.

Makefile

config.mk:

Result

java -cp ${class.path} org.apache.xalan.xslt.Process \
 -IN notch2.xml \
 -XSL gene2html.xsl -EDUMP -OUT result.html

NOTCH2

Omim ID 610205

Label

Alagille syndrome

Synonym

Arteriohepatic dysplasia (disorder)

Sub-Class Of

Label

gastrointestinal system disease

Synonym

gastrointestinal disease

Sub-Class Of

Label

disease of anatomical entity

Sub-Class Of

Label: disease

Omim ID 102500

Label

Hajdu-Cheney syndrome

Synonym

Hajdu-Cheney syndrome (disorder)

Sub-Class Of

Label

autosomal dominant disease

Sub-Class Of

Label

autosomal genetic disease

Sub-Class Of

Label

monogenic disease

Sub-Class Of

Label

genetic disease

Sub-Class Of

Label: disease

That's it,

Pierre

14 August 2012

Apache Pig: first contact with some 'bio' data.

via wikipedia: "Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.".

here I've played with PIG using a local datastore (that is, different from a distributed environment) to handle some data from the UCSC database.

Download and Install

Install Pig:

$ wget "http://mirror.cc.columbia.edu/pub/software/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz"
$ tar xvfz pig-0.10.0.tar.gz
$ rm pig-0.10.0.tar.gz
$ cd pig-0.10.0
$ export PIG_INSTALL=${PWD}
$ export JAVA_HOME=/your/path/to/jdk1.7

Download 'knownGene' from the UCSC:

$ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz"
$ gunzip knownGene.txt.gz

Start the command line interface

run Pig’s Grunt shell in local mode:

$ pig -x local 
2012-08-15 10:37:25,247 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
2012-08-15 10:37:25,248 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/lindenb/tmp/HADOOP/pig-0.10.0/pig_1344933445243.log
2012-08-15 10:37:25,622 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

Getting the number of Genes by Chromosome

knownGenes = LOAD '/home/lindenb/tmp/HADOOP/knownGene.txt' as 
  ( name:chararray ,  chrom:chararray ,  strand:chararray ,  txStart:int , 
    txEnd:int ,  cdsStart:int ,  cdsEnd:int ,  exonCount:chararray ,
     exonStarts:chararray ,  exonEnds:chararray ,  proteinID:chararray , 
     alignID:chararray );

keep the genes coding for a protein:

coding = FILTER knownGenes BY cdsStart < cdsEnd ;

Remove some columns:

coding = FOREACH coding GENERATE chrom,cdsStart,cdsEnd,strand,name;

Group by chromosome:

C = GROUP coding by chrom;

Filter out some chromosomes:

C = FILTER C by NOT(
  group matches  '.*_random' OR
  group matches  'chrUn_.*' OR
  group matches '.*hap.*'
  );

Get the count of genes for each chromosome:

D= FOREACH C GENERATE
 group as CHROM,
 COUNT(coding.name) as numberOfGenes
 ;

And dump the data:

dump D;

Result:

(chr1,6139)
(chr2,3869)
(chr3,3406)
(chr4,2166)
(chr5,2515)
(chr6,3021)
(chr7,2825)
(chr8,1936)
(chr9,2338)
(chrM,1)
(chrX,2374)
(chrY,318)
(chr10,2470)
(chr11,3579)
(chr12,3066)
(chr13,924)
(chr14,2009)
(chr15,1819)
(chr16,2510)
(chr17,3396)
(chr18,862)
(chr19,3784)
(chr20,1570)
(chr21,663)
(chr22,1348)

Interestingly, noting seems to happen until your ask to dump the data.

Finding the overlapping genes

I want to get a list of pairs of genes overlapping and having an opposite strand. I've not been able to find a quick way to join two tables using a complex criteria.

Create a two identical lists of genes E1 and E2 . Add an extra column "1" that will be used to join both tables.

E1 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
E2 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;

Join the tables using the extra column.

E3 = join E1 by pivot, E2 by pivot;

Extract and rename the fields from the join:

E3 = FOREACH E3 generate 
 $1 as chrom1, $2 as start1, $3 as end1, $4 as strand1, $5 as name1,
 $7 as chrom2, $8 as start2, $9 as end2, $10 as strand2, $11 as name2
 ;

At this point, the data in E3 look like this:

(...)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc001aau.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc001abe.4)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc009vjk.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc009vjm.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,12189,13639,+,uc010nxq.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,367658,368597,+,uc010nxu.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,621095,622034,-,uc010nxv.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,324514,325605,+,uc021oeh.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,327745,328213,+,uc021oei.1)
(...)

Extract the overlapping genes:

E3= FILTER E3 BY
    name1 < name2 AND
    chrom1==chrom2 AND
    strand1!=strand2 AND
    NOT(end1 < start2 OR end2 < start1);

and dump the result:

dump E3

After a few hours the result is computed:

(...)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjw.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjx.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc022bmo.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,130455527,130477918,-,uc004brm.3,chr9,130469310,130476184,+,uc004brn.1)
(chr9,131703812,131719311,+,uc004bwq.1,chr9,131707965,131709582,-,uc004bwr.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cuu.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cuu.3)
(...)

That was very slow. There might be a better way to do this and I wonder if using a hadoop filesystem would really speed the computation. At this point I'll continue to use a SQL database for such small amount of data.

That's it.

Pierre

YOKOFAKUN

17 May 2016

finding new intron-exon junctions using the public Encode RNASeq data

All in one:

20 December 2012

RDF/Jena: a simple extension for XSLT/XALAN. Testing with NCBI-Gene

The XSLT Stylesheet

The Java code

Makefile

Result

NOTCH2

Omim ID 610205

Omim ID 102500

14 August 2012

Apache Pig: first contact with some 'bio' data.

Download and Install

Start the command line interface

Getting the number of Genes by Chromosome

Finding the overlapping genes

About Me

Feeds

Blog Archive

Web2.0

Labels