29 November 2012

Reading/Writing a VCF file with the GATK-API.

This is a simple post to save my notes about reading a VCF file and writing it to another file using the java libraries of the GATK. The only way I found requires a SAMSequenceDictionary and always writes an index.:

The code

import java.io.*;
import org.broad.tribble.AbstractFeatureReader;
import org.broad.tribble.FeatureReader;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.codecs.vcf.*;
import org.broadinstitute.sting.utils.variantcontext.VariantContext;
import org.broadinstitute.sting.utils.variantcontext.writer.*;
import net.sf.samtools.SAMSequenceDictionary;
import net.sf.picard.reference.*;

import java.util.Iterator;
import java.util.Map;
/**
 * motivation:
 *      copy a VCF 
 * usage:
 * javac -cp ${GATK}  ReadVCF.java
 * java -cp ${GATK}:. ReadVCF ref.fa my.vcf
 */
public class ReadVCF
 {
 public static void main(String args[]) throws Exception
  {
  /** latest VCF specification */
  final VCFCodec vcfCodec = new VCFCodec();
  /** we don't need some indexed VCFs */
  boolean requireIndex=false;
  /* load a SAM sequence dictionary */
  SAMSequenceDictionary dict=new IndexedFastaSequenceFile(
    new File(args[0])).getSequenceDictionary();
  /* loop over each vcf */
  for(int i=1;i< args.length;++i)
   {
   /* input VCF */
   String filename=args[i];
   /* output VCF */
   File fileout=new File("tmp"+i+".vcf"); 
   VariantContextWriter writer=VariantContextWriterFactory.create(fileout,dict);
   /* get A VCF Reader */
   FeatureReader<VariantContext> reader = AbstractFeatureReader.getFeatureReader(
      filename, vcfCodec, requireIndex);
   /* read the header */
   VCFHeader header = (VCFHeader)reader.getHeader();
   /* write the header */
   writer.writeHeader(header);
   /** loop over each Variation */
   Iterator<VariantContext> it = reader.iterator();
              while ( it.hasNext() )
               {
               /* get next variation and save it */
     VariantContext vc = it.next();
     writer.add(vc);
    }
   /* we're done */
   reader.close();
   writer.close();
   }  
  }
 }

Makefile

GATK=GenomeAnalysisTKLite-2.2-15-g4828906/GenomeAnalysisTKLite.jar
VCF=gatk-master/public/testdata/exampleDBSNP.vcf
REF=./gatk-master/public/testdata/exampleFASTA.fasta
all: 
 javac -cp ${GATK} -nowarn ReadVCF.java
 java -cp ${GATK}:. ReadVCF $(REF) ${VCF}

That's it,
Pierre

21 November 2012

visualizing the dependencies in a Makefile

Update 2014: I wrote a C version at https://github.com/lindenb/makefile2graph.
I've just coded a tool to visualize the dependencies in a Makefile. The java source code is available on github at : https://github.com/lindenb/jsandbox/blob/master/src/sandbox/MakeGraphDependencies.java. This simple tool parses the ouput of
make -dq
( here option '-d' is 'Print lots of debugging information' and '-q' is 'Run no commands') and prints a graphiz-dot file.

Example

Below is a simple NGS workflow:
%.bam.bai : %.bam
 
file.vcf:  merged.bam.bai ref.fa
merged.bam : sorted1.bam sorted2.bam
sorted1.bam: lane1_1.fastq  lane1_2.fastq ref.fa
sorted2.bam: lane2_1.fastq  lane2_2.fastq ref.fa
Invoking the program:
make -d --dry-run | java -jar makegraphdependencies.jar
generates the following graphiz-dot file:
digraph G {
n9[label="sorted2.bam" ];
n3[label="merged.bam.bai" ];
n10[label="lane2_1.fastq" ];
n11[label="lane2_2.fastq" ];
n2[label="file.vcf" ];
n4[label="merged.bam" ];
n6[label="lane1_1.fastq" ];
n8[label="ref.fa" ];
n7[label="lane1_2.fastq" ];
n0[label="[ROOT]" ];
n5[label="sorted1.bam" ];
n1[label="Makefile" ];
n10->n9;
n11->n9;
n8->n9;
n4->n3;
n3->n2;
n8->n2;
n9->n4;
n5->n4;
n2->n0;
n1->n0;
n6->n5;
n8->n5;
n7->n5;
}
The result: (here using the google chart API for Graphviz)

That's it,
Pierre

13 November 2012

Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

In the current post, I will use the Jena API for RDF to implement a virtual RDF graph describing the content of a set of openoffice/libreoffice spreasheets.

Fact: An openoffice file (*.ods) is a Zip file

An openoffice file is nothing but a zip file:
$ unzip -t jeter.ods 
Archive:  jeter.ods
    testing: mimetype                 OK
    testing: meta.xml                 OK
    testing: settings.xml             OK
    testing: content.xml              OK
    testing: Thumbnails/thumbnail.png   OK
    testing: Configurations2/images/Bitmaps/   OK
    testing: Configurations2/popupmenu/   OK
    testing: Configurations2/toolpanel/   OK
    testing: Configurations2/statusbar/   OK
    testing: Configurations2/progressbar/   OK
    testing: Configurations2/toolbar/   OK
    testing: Configurations2/menubar/   OK
    testing: Configurations2/accelerator/current.xml   OK
    testing: Configurations2/floater/   OK
    testing: styles.xml               OK
    testing: META-INF/manifest.xml    OK
No errors detected in compressed data of jeter.ods.

The entry content.xml is a XML file describing the tables in the spreadsheet:
$ unzip -c jeter.ods content.xml |\
grep -v Archive |\
grep -v inflating | xmllint --format - |\
head -n 20


<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
    <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:name="co1" style:family="table-column">
      <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/>
    </style:style>
    <style:style style:name="ro2" style:family="table-row">
      <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ro3" style:family="table-row">
      <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ta1" style:family="table" style:master-page-name="Default">

Fact: Implementing a simple virtual RDF graph with Jena is easy

By virtual I mean that there is no RDFStore, the triples are created on the fly.
Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.

(...)
 @Override
    protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
        {
        return ...;
        }
(...)

The code

My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.

Compilation

the Makefile:
CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" |  | tr "\n" ":"`
.PHONY: all
all:
 javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java
 jar cvf dist/openoffice2rdf.jar -C src .

Querying using sparql

Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:

The spreadsheet

I've created the following spreadsheet and saved it in a file named "jeter.ods":
CHROMSTARTENDNAME
chr1100200rs654
chr1150250rs264
chr1200300rs610
chr1250350rs929
chr1300400rs408
chr1350450rs346
chr1400500rs430
chr1450550rs735
chr1500600rs575
chr1550650rs891
chr1600700rs627
chr1650750rs650
chr1700800rs715
chr1750850rs467
chr1800900rs882
chr1850950rs301
chr19001000rs643
chr19501050rs246
chr110001100rs178
chr110501150rs928
chr111001200rs213

The sparql query

The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:


Invoke:
java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar  \
 oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods

Result:
-----------------------------------------------------------------------------------------------------------------------------------
| s                                       | p                                                 | o                                 |
===================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X                                          | "1"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value                                      | "chr1"                            |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X                                          | "2"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value                                      | "150"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value                                      | "250"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X                                          | "4"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value                                      | "rs264"                           |
-----------------------------------------------------------------------------------------------------------------------------------

Serving the OpenOffice spreadsheets as RDF over HTTP

Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.

Download an install Fuseki

wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz
tar xfz jena-fuseki-0.2.5-distribution.tar.gz
rm jena-fuseki-0.2.5-distribution.tar.gz

Tell Fuseki about our OpenOfficeCalcGraph

We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:

The line:
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .
loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:
(...)
static { init() ; }
    private static void init()
        {
        (...)
In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler();
(...)
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build");
(...)
Assembler.general.implementWith(buildRsrc,assembler);
(...)
An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:
<#ooservice> rdf:type openoffice:build ;
    openoffice:file "/home/lindenb/jeter.ods" ;
    openoffice:file "/home/lindenb/jeter2.ods" ;
.
This config is read in the Assembler:
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler
      {
      @Override
      public Object open( Assembler a, Resource root, Mode mode )
            {
            Property fileRsrc=ResourceFactory.createProperty(NS+"file");
            //read the configuration an get the files
            List<File> files=new ArrayList<File>();
            StmtIterator iter=root.listProperties(fileRsrc);
     (...)

Start Fuseki with the config file:

$ cd jena-fuseki-0.2.5
$ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar  org.apache.jena.fuseki.FusekiCmd \
    --debug  -v --config /path/to/openoffice.ttl
14:11:50 INFO  Config               :: Configuration file: ../openoffice.ttl
14:11:50 INFO  Config               :: Service: :service1
14:11:50 INFO  Config               ::   name = ds
14:11:50 INFO  Config               ::   query = /ds/query
14:11:50 INFO  Config               ::   query = /ds/sparql
14:11:50 INFO  Config               ::   update = /ds/update
14:11:50 INFO  Config               ::   upload = /ds/upload
14:11:50 INFO  Config               ::   graphStore(RW) = /ds/data
14:11:50 INFO  Config               ::   graphStore(R) = /ds/get
14:11:50 INFO  ooffice2rdf          :: Calling OpenOfficeCalcGraph init
14:11:50 INFO  Config               :: Service: OpenOffice Service (R)
14:11:50 INFO  Config               ::   name = openoffice
14:11:50 INFO  Config               ::   query = /openoffice/sparql
14:11:50 INFO  Config               ::   query = /openoffice/query
14:11:50 INFO  Config               ::   update = /openoffice/update
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/get
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/data
14:11:51 INFO  Server               :: Dataset path = /ds
14:11:51 INFO  Server               :: Dataset path = /openoffice
14:11:51 INFO  Server               :: Fuseki 0.2.5 2012-10-20T17:03:29+0100
14:11:51 INFO  Server               :: Started 2012/11/13 14:11:51 CET on port 3030
Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:
Fuseki Control Panel
Dataset:

The following form is displayed:
SPARQL Query




Output:


XSLT style sheet (blank for none):




Force the accept header to text/plain regardless



You can now copy, paste and run the previous sparql query:
--------------------------------------------------------------------------------------------------------------------------------------------------
| s                                        | p                                                 | o                                               |
==================================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/value>                    | "chr1"                                          |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/value>                    | "150"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/value>                    | "250"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/value>                    | "rs264"                                         |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value>                    | "1"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.od
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value>                    | "2"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value>                    | "3"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value>                    | "4"^^<http://www.w3.org/2001/XMLSchema#float>   |
--------------------------------------------------------------------------------------------------------------------------------------------------

That's it,

Pierre

02 November 2012

Saving your tweets in a database using sqlite, rhino, scribe, javascript

In the current post, I 'll describe a simple method to save your tweets in a sqlite database using Mozilla Rhino.

Prerequisites

  • sqlite
  • Apache Rhino. I think it should be de-facto available when the java developer toolkit (JDK) is installed
  • Scribe, the simple OAuth library for Java . It also requires Apache codec

The config.js file

Open an account on https://dev.twitter.com/ and create an App to receive an API-key and an API-secret.
Create the following file 'config.js' filled with the correct parameters.

The javascript

The following javascript file opens a Oauth connection, retrieves the tweets and stores them into sqlite. I've commented the code, I hope it is clear enough.

Running the script using Rhino

scribe.libs=/path/to/scribe-1.3.2.jar:/path/to/commons-codec.jar
rhino.libs=/usr/share/java/js.jar:/usr/share/java/jline.jar
sqlite.libs=/path/to/sqlitejdbc-v056.jar
CLASSPATH=${rhino.libs}:${scribe.libs}:${sqlite.libs}

java -cp ${CLASSPATH} org.mozilla.javascript.tools.shell.Main -f twitter2sqlite.js
At the first time, the user is asked to authorize the application to use the twitter API

The script runs forever (Ctrl-C to break), listening to the new tweets.

As a test, I wrote the following tweet:


... and the tweet was later inserted in the database...

Sleep...

Inserted ({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This is a test . ( #rhino, #jdbc, #sqlite, #scribe #javascript )", source:"web", truncated:false, in_reply_to_status_id:null, in_reply_to_status_id_str:null, in_reply_to_user_id:null, in_reply_to_user_id_str:null, in_reply_to_screen_name:null, geo:null, coordinates:null, place:null, contributors:null, retweet_count:0, entities:{hashtags:[{text:"rhino", indices:[51, 57]}, {text:"jdbc", indices:[59, 64]}, {text:"sqlite", indices:[66, 73]}, {text:"scribe", indices:[75, 82]}, {text:"javascript", indices:[83, 94]}], urls:[], user_mentions:[]}, favorited:false, retweeted:false})

Sleep...
Sleep...
Sleep...

Later, the tweets can be extracted using the sqlite command line:

$  sqlite3 tweets.sqlite 'select * from tweet'

264464160664981504|({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This
264421310841638913|({created_at:"Fri Nov 02 17:38:47 +0000 2012", id:264421310841638900, id_str:"264421310841638913", text:"The tools for recalibration have cha
264264932097400832|({created_at:"Fri Nov 02 07:17:24 +0000 2012", id:264264932097400830, id_str:"264264932097400832", text:"@warandpeace you're welcome. Your sh
264158323287416832|({created_at:"Fri Nov 02 00:13:46 +0000 2012", id:264158323287416830, id_str:"264158323287416832", text:"Drawing of the day November 1, 2012.
264142732174438400|({created_at:"Thu Nov 01 23:11:49 +0000 2012", id:264142732174438400, id_str:"264142732174438400", text:"[delicious] PLOS Collections: How th
264064117558624256|({created_at:"Thu Nov 01 17:59:26 +0000 2012", id:264064117558624260, id_str:"264064117558624256", text:"I've added a stupid basic dependency
264025607724204034|({created_at:"Thu Nov 01 15:26:24 +0000 2012", id:264025607724204030, id_str:"264025607724204034", text:"in the desert lab, checking my on-go
264013563704795136|({created_at:"Thu Nov 01 14:38:33 +0000 2012", id:264013563704795140, id_str:"264013563704795136", text:"Drawing of the day November 1, 2012.
263996436679630848|({created_at:"Thu Nov 01 13:30:29 +0000 2012", id:263996436679630850, id_str:"263996436679630848", text:"RT @RealistComics: he's tall, dark a
263966759210590208|({created_at:"Thu Nov 01 11:32:34 +0000 2012", id:263966759210590200, id_str:"263966759210590208", text:"RT @guermonprez: #Aubry Un avion nor
263946369847398402|({created_at:"Thu Nov 01 10:11:33 +0000 2012", id:263946369847398400, id_str:"263946369847398402", text:"[delicious] OVal: object validation 
263946366919790593|({created_at:"Thu Nov 01 10:11:32 +0000 2012", id:263946366919790600, id_str:"263946366919790593", text:"[delicious] MyBatis #tweet: a first 
263941020729896960|({created_at:"Thu Nov 01 09:50:17 +0000 2012", id:263941020729896960, id_str:"263941020729896960", text:"RT @josh_wills: I have never been pr
263938670187388928|({created_at:"Thu Nov 01 09:40:57 +0000 2012", id:263938670187388930, id_str:"263938670187388928", text:"RT @softmodeling @peterneubauer: Usi
263936362716200960|({created_at:"Thu Nov 01 09:31:47 +0000 2012", id:263936362716200960, id_str:"263936362716200960", text:"declined to review an article about 
263934528186351616|({created_at:"Thu Nov 01 09:24:29 +0000 2012", id:263934528186351600, id_str:"263934528186351616", text:"@figshare Thanks, ( was http://t.co/
263815846139412480|({created_at:"Thu Nov 01 01:32:53 +0000 2012", id:263815846139412480, id_str:"263815846139412480", text:"Drawing of the day October 30, 2012.
263731855919026176|({created_at:"Wed Oct 31 19:59:09 +0000 2012", id:263731855919026180, id_str:"263731855919026176", text:"[delicious] An integrated map of gen
263726281647067136|({created_at:"Wed Oct 31 19:36:59 +0000 2012", id:263726281647067140, id_str:"263726281647067136", text:"RT @bryan_howie: 1000 Genomes paper 
263695076516052992|({created_at:"Wed Oct 31 17:33:00 +0000 2012", id:263695076516053000, id_str:"263695076516052992", text:"\"Forget your Past\" ( abandoned Bul

That's it
Pierre