After watching David Haussler's talk "Beacon Project and Data Sharing ApIs", I wanted to play with Avro and the models and APIs defined by the Global Alliance for Genomics and Health (ga4gh) coalition Here is my notebook.
(Wikipedia) Avro: "Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services."
First, we download the java tools and libraries for apache Avro
curl -L -o avro-tools-1.7.7.jar "http://www.eng.lsu.edu/mirrors/apache/avro/avro-1.7.7/java/avro-tools-1.7.7.jar"
Next, we download the schemas defined by the ga4gh from github
curl -L -o schema.zip "https://github.com/ga4gh/schemas/archive/v0.5.1.zip" unzip schema.zip rm schema.zip $ find -name "*.avdl" ./schemas-0.5.1/src/main/resources/avro/readmethods.avdl ./schemas-0.5.1/src/main/resources/avro/common.avdl ./schemas-0.5.1/src/main/resources/avro/wip/metadata.avdl ./schemas-0.5.1/src/main/resources/avro/wip/metadatamethods.avdl ./schemas-0.5.1/src/main/resources/avro/wip/variationReference.avdl ./schemas-0.5.1/src/main/resources/avro/variants.avdl ./schemas-0.5.1/src/main/resources/avro/variantmethods.avdl ./schemas-0.5.1/src/main/resources/avro/beacon.avdl ./schemas-0.5.1/src/main/resources/avro/references.avdl ./schemas-0.5.1/src/main/resources/avro/referencemethods.avdl ./schemas-0.5.1/src/main/resources/avro/reads.avdl
Those schema can be compiled to java using the avro-tools
$ java -jar avro-tools-1.7.7.jar compile protocol schemas-0.5.1/src/main/resources/avro/ ./generated Input files to compile: schemas-0.5.1/src/main/resources/avro/variants.avpr $ find generated/org/ -name "*.java" generated/org/ga4gh/GAPosition.java generated/org/ga4gh/GAVariantSetMetadata.java generated/org/ga4gh/GACall.java generated/org/ga4gh/GAException.java generated/org/ga4gh/GACigarOperation.java generated/org/ga4gh/GAVariantSet.java generated/org/ga4gh/GAVariants.java generated/org/ga4gh/GAVariant.java generated/org/ga4gh/GACallSet.java generated/org/ga4gh/GACigarUnit.java
As a test, the following java source uses the classes generated by avro to create nine variants and serialize them to Avro
Compile, archive and execute:
Output:
Compile, archive and execute:
#compile classes javac -d generated -cp avro-tools-1.7.7.jar -sourcepath generated:src generated/org/ga4gh/*.java src/test/TestAvro.java # archive jar cvf generated/ga4gh.jar -C generated org -C generated test # run java -cp avro-tools-1.7.7.jar:generated/ga4gh.jar test.TestAvro > variant.avroWe use the avro-tools to convert the generated file variant.avro to json
java -jar avro-tools-1.7.7.jar tojson variant.avro
Output:
The complete Makefile
That's it,
Pierre
No comments:
Post a Comment