YOKOFAKUN: pdf

Showing posts with label pdf. Show all posts

06 March 2011

Creating a pdf of your favorite tweets with Apache FOP.

This post describes how I created a PDF document from a set of twitter statuses.

I created a XSLT stylesheet transforming a twitter status as XML.

This stylesheet is available on github at: https://github.com/.../twitter/status2fo.xsl.

The stylesheet transforms the XML file generated by the twitter API (e.g.:http://api.twitter.com/1/statuses/show/44175380516585472.xml) to XSL-FO.

This xsl-fo is then processed by Apache-FOP to generate a PDF.

fop -pdf result.pdf -xml status.xml -xsl status2fo.xsl

Result

Bioinfo tweets

View more presentations from Pierre Lindenbaum.

That's it,

Pierre

22 February 2011

A flex scanner extracting the metadata from a PDF file.

4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.

Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
strings |\
grep -A 100 "<x:xmpmeta"

I've generalized this process by implementing a GNU-flex scanner that prints the XML content between two <xmp:xmpmeta/> tags. The source code is available on github at: https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.

Compilation

flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c

Testing with "PLOS"

Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:

curl -s "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0016517&representation=PDF" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP toolkit 2.9.1-13, framework 1.6">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:iX="http://ns.adobe.com/iX/1.0/">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" pdf:Producer="Acrobat Distiller 7.0 (Windows)"/>
<rdf:Description xmlns:xap="http://ns.adobe.com/xap/1.0/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xap:CreateDate="2011-02-14T08:21:58+08:00" xap:CreatorTool="3B2 Total Publishing System 7.51n/W" xap:ModifyDate="2011-02-17T14:32:08+08:00" xap:MetadataDate="2011-02-17T14:32:08+08:00"/>
<rdf:Description xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xapMM:DocumentID="uuid:f98295b5-0980-42f2-884e-6cecd2d75c90" xapMM:InstanceID="uuid:d226393a-bfaf-4a2b-8c28-08215008694e"/>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" dc:format="application/pdf">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">pone.0016517 1..16</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

Hum, nothing really interesting here.

Testing with "Nature Reviews Cardiology"

A test with Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">
 © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

That's more interesting isn't it ?

That's it,

Pierre

08 February 2011

Visualizing my twitter network with Zoom.it

I wrote a small Java tool to download my twitter network as a GEXF file. This tool is available on github at:

https://github.com/lindenb/(...)/TwitterGraph.java

java -jar twittergraph.jar -o twittergraph.gexf 7431072 #my twitter ID

This tool doesn't use the OAuth API, so it have to wait for a few minutes, and retry to connect, every times it reaches the twitter API quotas (150 requests per hour). In the end it took one night to download the data from my network (~390 friends).

<gexf
   xmlns="http://www.gexf.net/1.1draft"
   xmlns:viz="http://www.gexf.net/1.1draft/viz"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   version="1.1"
   xsi:schemaLocation="http://www.gexf.net/1.1draft http://www.gexf.net/1.1draft/gexf.xsd">
  <meta lastmodifieddate="2011-02-04">
    <creator>Gephi 0.7</creator>
    <description/>
  </meta>
  <graph defaultedgetype="directed" timeformat="double" mode="dynamic">
    <attributes class="node" mode="static">
      <attribute id="name" title="name" type="string"/>
      <attribute id="screenName" title="screenName" type="string"/>
      <attribute id="imageUrl" title="imageUrl" type="string">
        <default>http://a3.twimg.com/sticky/default_profile_images/default_profile_1_reasonably_small.png</default>
      </attribute>
      <attribute id="location" title="location" type="string"/>
      <attribute id="description" title="description" type="string"/>
      <attribute id="protectedProfile" title="protectedProfile" type="boolean"/>
      <attribute id="friends" title="friends" type="integer"/>
      <attribute id="followers" title="followers" type="integer"/>
      <attribute id="listed" title="listed" type="integer"/>
      <attribute id="utc_offset" title="utc offset" type="integer"/>
      <attribute id="statuses_count" title="statuses count" type="integer"/>
    </attributes>
    <nodes>
      <node id="6612402" label="sciencebase">
        <attvalues>
          <attvalue for="name" value="David Bradley"/>
          <attvalue for="screenName" value="sciencebase"/>
          <attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/1142396198/twitter-blue-bradley_normal.jpg"/>
          <attvalue for="location" value="Cambridge, UK"/>
          <attvalue for="description" value="Science Writer David Bradley based in Cambridge, UK. Physical and life sciences news and views + technology, internet, web commentary."/>
          <attvalue for="protectedProfile" value="false"/>
          <attvalue for="friends" value="2022"/>
          <attvalue for="followers" value="9197"/>
          <attvalue for="listed" value="1065"/>
          <attvalue for="utc_offset" value="0"/>
          <attvalue for="statuses_count" value="7526"/>
        </attvalues>
      </node>
      <node id="19344270" label="EMBOcomm">
        <attvalues>
          <attvalue for="name" value="Suzanne Beveridge"/>
          <attvalue for="screenName" value="EMBOcomm"/>
          <attvalue for="imageUrl" value="http://a0.twimg.com/profile_images/1189685782/S_Beveridge5100_normal.JPG"/>
          <attvalue for="location" value="Heidelberg"/>
          <attvalue for="description" value="Follow me for the latest from EMBO, the European Molecular Biology Organization"/>
          <attvalue for="protectedProfile" value="false"/>
          <attvalue for="friends" value="396"/>
          <attvalue for="followers" value="697"/>
          <attvalue for="listed" value="59"/>
          <attvalue for="utc_offset" value="3600"/>
          <attvalue for="statuses_count" value="632"/>
        </attvalues>
      </node>
      <node id="20153702" label="walshtp">
        <attvalues>
          <attvalue for="name" value="Tom Walsh"/>
          <attvalue for="screenName" value="walshtp"/>
          <attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/644287976/IMG_0815_normal.JPG"/>
          <attvalue for="location" value="Dundee, Scotland"/>
          <attvalue for="description" value="Scientific programmer and sysadmin. "/>
          <attvalue for="protectedProfile" value="false"/>
          <attvalue for="friends" value="129"/>
          <attvalue for="followers" value="99"/>
          <attvalue for="listed" value="8"/>
          <attvalue for="utc_offset" value="0"/>
          <attvalue for="statuses_count" value="783"/>
        </attvalues>
      </node>
      <node id="15150655" label="konradfoerstner">
        <attvalues>
          <attvalue for="name" value="Konrad Förstner"/>
          <attvalue for="screenName" value="konradfoerstner"/>
          <attvalue for="imageUrl" value="http://a3.twimg.com/profile_images/643611092/konrad_avantar2_normal.jpeg"/>
          <attvalue for="location" value="here and there"/>
          <attvalue for="description" value="Idealist, Scientist, Includist, Data analyst, Open Source|Data|Access, Coder, Command line friend, CouchSurfer, Konrad"/>
          <attvalue for="protectedProfile" value="false"/>
          <attvalue for="friends" value="266"/>
          <attvalue for="followers" value="167"/>
          <attvalue for="listed" value="17"/>
          <attvalue for="utc_offset" value="3600"/>
          <attvalue for="statuses_count" value="1948"/>
        </attvalues>
      </node>

(...)
      
      <edge id="E3811" source="14899756" target="14295341">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4816" source="14899756" target="19542750">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4830" source="14899756" target="60065276">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E339" source="14899756" target="617133">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4807" source="14899756" target="15276911">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4822" source="14899756" target="26506721">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4824" source="14899756" target="27023131">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4819" source="14899756" target="22406785">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4808" source="14899756" target="16170580">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E1237" source="14899756" target="4339911">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4828" source="14899756" target="56564230">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4826" source="14899756" target="33838201">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
      <edge id="E4815" source="14899756" target="19002481">
        <attvalues>
          <attvalue for="weight" value="1.0"/>
        </attvalues>
      </edge>
    </edges>
  </graph>
</gexf>

The GEXF file was then opened with Gephi, processed with the ForceAtlas algorithm and exported as a PDF file.

The PDF file was uploaded on scribd: http://www.scribd.com/doc/48415306/My-Twitter-Network

My Twitter Network

I then, downloaded the PDF from scribd.com, quickly copied the URL of the generated PDF and pasted it into http://zoom.it/.

Here is the result ! :-)

That's it !

Pierre

11 July 2010

PDFBox: insert/extract metadata from/into a PDF document

The apache project PDFBox contains is an API for handling some PDF documents. In the current post I'll show how I've used the PDFBox API to insert and extract some XMP metadata into/from a PDFDocument.

Extracting metadata from a PDF document

Reading the metadat is as simple as:

InputStream in=new FileInputStream(pdfFile);
PDFParser parser=new PDFParser(in);
parser.parse();
PDMetadata metadata = parser.getPDDocument().getDocumentCatalog().getMetadata();
if(metadata!=null)
{
System.out.println(metadata.getInputStreamAsString());
}

Inserting metadata into a PDF document

The metadata to be inserted are stored in a XML file.

<?xml version="1.0" encoding="UTF-8"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
        <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"  xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/">
                <rdf:Description rdf:about="">
                        <dc:creator rdf:resource="mailto:plindenbaum@yahoo.fr"/>
                        <dc:title>Hello World</dc:title>
                        <dc:date>2010-07-11</dc:date>
                </rdf:Description>
                <foaf:Person rdf:about="mailto:plindenbaum@yahoo.fr">
                        <foaf:name>Pierre Lindenbaum</foaf:name>
                        <foaf:depiction rdf:resource="http://a3.twimg.com/profile_images/51679789/photoIG_bigger.jpg"/>
                </foaf:Person>
        </rdf:RDF>
</x:xmpmeta>

This XML file is loaded as a DOM object in memory:

DocumentBuilderFactory f= DocumentBuilderFactory.newInstance();
f.setExpandEntityReferences(true);
f.setIgnoringComments(true);
f.setIgnoringElementContentWhitespace(true);
f.setValidating(false);
f.setCoalescing(true);
f.setNamespaceAware(true);
DocumentBuilder builder=f.newDocumentBuilder();
xmpDoc= builder.parse(xmpIn);

The pdf source is opened and the DOM document is inserted as a metadata. The pdf is then saved:

InputStream in=new FileInputStream(pdfIn);
PDFParser parser=new PDFParser(in);
parser.parse();
document= parser.getPDDocument();
PDDocumentCatalog cat = document.getDocumentCatalog();
PDMetadata metadata = new PDMetadata(document);
metadata.importXMPMetadata(new XMPMetadata(xmpDoc));
cat.setMetadata(metadata);
document.save(pdfOut);

Source code: ExtractXMP.java

import java.io.*;
import org.apache.pdfbox.pdfparser.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.common.*;
import org.apache.jempbox.xmp.XMPMetadata;

public class ExtractXMP
{
static private void extract(InputStream in)
throws Exception
{
PDDocument document=null;
try
{
PDFParser parser=new PDFParser(in);
parser.parse();
document= parser.getPDDocument();
if(document.isEncrypted())
{
System.err.println("Document is Encrypted!");
}
PDDocumentCatalog cat = document.getDocumentCatalog();
PDMetadata metadata = cat.getMetadata();
if(metadata!=null)
{
System.out.println(metadata.getInputStreamAsString());
}
}
catch(Exception err)
{
throw err;
}
finally
{
if(document!=null) try { document.close();} catch(Throwable err2) {}
}
}

static public void main(String args[])
{
try
{
int optind=0;
while(optind<args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Pierre Lindenbaum PhD. 2010");
System.err.println("-h this screen");
System.err.println("pdf1 pdf2 pdf3 ....");
return;
}
else if (args[optind].equals("--"))
{
++optind;
break;
}
else if (args[optind].startsWith("-"))
{
System.err.println("bad argument " + args[optind]);
System.exit(-1);
}
else
{
break;
}
++optind;
}
if(optind==args.length)
{
extract(System.in);
}
else
{
while(optind< args.length)
{
String filename=args[optind++];
InputStream in=new FileInputStream(filename);
extract(in);
in.close();
}
}

}
catch(Throwable err)
{
err.printStackTrace();
}
}
}

Source code: InsertXMP.java

import java.io.*;
import org.apache.pdfbox.pdfparser.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.common.*;
import org.apache.jempbox.xmp.XMPMetadata;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

public class InsertXMP
{

static public void main(String args[])
{
PDDocument document=null;
InputStream in=null;
try
{
String xmpIn=null;
String pdfIn=null;
String pdfOut=null;
Document xmpDoc=null;
int optind=0;
while(optind<args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Pierre Lindenbaum PhD. 2010");
System.err.println("-h this screen");
System.err.println("-pdfin|-i <pdf-in>");
System.err.println("-xmpin|-x <xmp-in>");
System.err.println("-pdfout|-o <pdf-out>");
return;
}
else if(args[optind].equals("-xmpin") || args[optind].equals("-x"))
{
xmpIn= args[++optind];
}
else if(args[optind].equals("-pdfin") || args[optind].equals("-i"))
{
pdfIn= args[++optind];
}
else if(args[optind].equals("-pdfout") || args[optind].equals("-o"))
{
pdfOut= args[++optind];
}
else if (args[optind].equals("--"))
{
++optind;
break;
}
else if (args[optind].startsWith("-"))
{
System.err.println("bad argument " + args[optind]);
System.exit(-1);
}
else
{
break;
}
++optind;
}
if(optind!=args.length)
{
System.err.println("Illegal number of arguments");
return;
}
if(pdfIn==null)
{
System.err.println("pdf-in missing");
return;
}
if(pdfOut==null)
{
System.err.println("pdf-out missing");
return;
}
if(pdfIn.equals(pdfOut))
{
System.err.println("pdf-out is same as pdf-in");
return;
}
if(xmpIn==null)
{
System.err.println("XMP missing");
return;
}
else
{
DocumentBuilderFactory f= DocumentBuilderFactory.newInstance();
f.setExpandEntityReferences(true);
f.setIgnoringComments(true);
f.setIgnoringElementContentWhitespace(true);
f.setValidating(false);
f.setCoalescing(true);
f.setNamespaceAware(true);
DocumentBuilder builder=f.newDocumentBuilder();
xmpDoc= builder.parse(xmpIn);
}

in=new FileInputStream(pdfIn);
PDFParser parser=new PDFParser(in);
parser.parse();
document= parser.getPDDocument();
if(document.isEncrypted())
{
System.err.println("Warning ! Document is Encrypted!");
}
PDDocumentCatalog cat = document.getDocumentCatalog();
PDMetadata metadata = new PDMetadata(document);
metadata.importXMPMetadata(new XMPMetadata(xmpDoc));
cat.setMetadata(metadata);
document.save(pdfOut);
}
catch(Throwable err)
{
err.printStackTrace();
}
finally
{
if(document!=null) try { document.close();} catch(Throwable err2) {}
if(in!=null) try { in.close();} catch(Throwable err2) {}
}
}
}

Example

The following Makefile downloads a pdf file, compiles both program , inserts and extracts the metadata:

CLASSPATH=pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:.
test: InsertXMP.class ExtractXMP.class article.pdf
echo "Metadata in article"
java -cp ${CLASSPATH} ExtractXMP article.pdf
echo "Insert Metadata in article"
java -cp ${CLASSPATH} InsertXMP -i article.pdf -o article_meta.pdf -x metadata.xmp
echo "Metadata in new article"
java -cp ${CLASSPATH} ExtractXMP article_meta.pdf
InsertXMP.class:InsertXMP.java
javac -cp ${CLASSPATH} InsertXMP.java
ExtractXMP.class:ExtractXMP.java
javac -cp ${CLASSPATH} ExtractXMP.java

article.pdf:
wget -O $@ "http://www.biomedcentral.com/content/pdf/1471-2156-10-16.pdf"

Output

javac -cp pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:. InsertXMP.java
javac -cp pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:. ExtractXMP.java

wget -O article.pdf "http://www.biomedcentral.com/content/pdf/1471-2156-10-16.pdf"
--2010-07-11 13:15:10-- http://www.biomedcentral.com/content/pdf/1471-2156-10-16.pdf
Resolving www.biomedcentral.com... 213.219.33.18
Connecting to www.biomedcentral.com|213.219.33.18|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1295524 (1.2M) [application/pdf]
Saving to: `article.pdf'

100%[================================================================================>] 1,295,524 123K/s in 11s

2010-07-11 13:15:21 (113 KB/s) - `article.pdf' saved [1295524/1295524]

Metadata in article
java -cp pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:. ExtractXMP article.pdf

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Acrobat Distiller 7.0 (Windows)</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xap="http://ns.adobe.com/xap/1.0/">
<xap:CreateDate>2009-05-05T19:28:38Z</xap:CreateDate>
<xap:CreatorTool>FrameMaker 7.1</xap:CreatorTool>
<xap:ModifyDate>2009-05-06T02:18:59+05:30</xap:ModifyDate>
<xap:MetadataDate>2009-05-06T02:18:59+05:30</xap:MetadataDate>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">1471-2156-10-16.fm</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>Ezhilan</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
<xapMM:DocumentID>uuid:d1d0f8d9-8321-4e4b-828b-d31b75daba0f</xapMM:DocumentID>
<xapMM:InstanceID>uuid:39d7db98-d873-4b33-be85-87319547e81c</xapMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>

<?xpacket end="w"?>

Insert Metadata in article

java -cp pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:. InsertXMP -i article.pdf -o article_meta.pdf -x meta.xmp

Metadata in new article

java -cp pdfbox-app-1.2.1.jar:pdfbox-app-1.2.1.jar:. ExtractXMP article_meta.pdf
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="">
<dc:creator rdf:resource="mailto:plindenbaum@yahoo.fr"/>
<dc:title>Hello World</dc:title>
<dc:date>2010-07-11</dc:date>
</rdf:Description>
<foaf:Person rdf:about="mailto:plindenbaum@yahoo.fr">
<foaf:name>Pierre Lindenbaum</foaf:name>
<foaf:depiction rdf:resource="http://profile.ak.facebook.com/profile5/1306/97/s501154465_2583.jpg"/>
</foaf:Person>
</rdf:RDF>
</x:xmpmeta>

That's it

Pierre

06 November 2009

My PDFs anywhere.

A short post: I was asked to write a web server to allow people access their PDFs when they are away from the laboratory. People enter a Doi, a PMID or the URL of the page and the system tries to retrieve the PDF using a set of pre-defined patterns (e.g. the PDF of http://www.pnas.org/content/X/Y/Z is http://www.pnas.org/content/X/Y/Z.full.pdf ). This idea was suggested by Chris Miller on FriendFeed. I've also included a Bookmarklet identifying the current page or the text selected and invoking the web server. The server looks like this:

Password required

Examples

doi:10.1073/pnas.0904867106 (PNAS open access)
doi:10.1073/pnas.0904756106 (PNAS restricted access)
http://www.nature.com/nature/journal/v461/n7267/full/nature08515.html (Nature, restricted access)
http://bioinformatics.oxfordjournals.org/cgi/content/full/25/21/2735 (Bioinformatics , open access)
http://bioinformatics.oxfordjournals.org/cgi/content/full/25/21/2744 (Bioinformatics, restricted access)
19805134 (PMID)
http://www.ncbi.nlm.nih.gov/pubmed/19805134

This bookmarklet will bring up a new window containing the 'fetchpdf' form from any page on the Web:GetPDF

Of course, I cannot give the URL of my server, but the source code is available here:

http://code.google.com/p/cephb/source/browse/...

That's it
Pierre

08 October 2008

Building a presentation with inkscape + batik. My notebook.

OK, I hate PowerPoint ...

... and I hate OpenOffice/Impress

Next week, I'll present a talk about how to handle a bibliography with the tools available on the web (RSS, social bookmarking, zotero, etc...). Today I tried to build the slides using inkscape (the SVG editor) and apache batik (a Java-based toolkit for applications that want to use images in the SVG format).

Each slide was drawn using inkscape. The background was designed using this hack. Each slide was then converted to PDF using the Batik-Rasterizer (Problem: inkscape already supports SVG1.2 with new tags such as flowRoot but batik does not. This problem was solved by converting the texts to their pathes . Also, be careful before moving the files, the path to the pictures are relatives in inskcape )

java -jar ${BATIK_PATH}/batik-rasterizer.jar -bg 0.0.0.0 -m "application/pdf"  *.svg 

All the slides were then merged into an unique file using ghostscript.

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=slides.pdf -dBATCH *.pdf

Here is the result (this is still a draft, sorry the references of the pictures are still missing):

(DRAFT) Bibliography 2.0 - Upload a Document to Scribd

Pierre

22 May 2007

Is there any XMP in scientific pdf ? (No)

Roderic Page from iPhylo has introduced XMP in his blog. XMP is an Adobe format used to store metadata in files, such as PDFs. Adobe also provides an API to extract the XMP from the files.

I've downloaded the toolkit to see if any meta information could be extracted from the scientific papers. The adobe toolkit needs expat (a XML parser) to be installed and it comes with a sample application 'DumpScannedXMP' finding all XMP Packets in a file and printing their content.

I've tested this with some papers found on the net.
./target/i80386linux/debug/DumpScannedXMP 3851.pdf
RoXaN, a Novel Cellular Protein Containing TPR, LD, and Zinc Finger Motifs, Forms a Ternary Complex with Eukaryotic Initiation Factor 4G and Rotavirus NSP3: from Journal of Virology 2003

// ==============================================================

// Dumping raw input for "/home/pierre/3851.pdf" (879724..881254)

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
         <xap:CreatorTool>XPP</xap:CreatorTool>
         <xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li/>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Keywords/>
         <pdf:Producer/>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>

Dumping XMPMeta object ""  (0x0)

   http://ns.adobe.com/xap/1.0/  xap:  (0x80000000 : schema)
      xap:CreateDate = "2004-03-16T15:35:47Z"
      xap:CreatorTool = "XPP"
      xap:ModifyDate = "2007-05-22T12:24:37Z"

   http://purl.org/dc/elements/1.1/  dc:  (0x80000000 : schema)
      dc:format = "application/pdf"
      dc:description  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = ""  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)
      dc:creator  (0x600 : isOrdered isArray)
         [1] = ""
      dc:title  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = ""  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)

   http://ns.adobe.com/pdf/1.3/  pdf:  (0x80000000 : schema)
      pdf:Keywords = ""
      pdf:Producer = ""

   http://ns.adobe.com/xap/1.0/mm/  xapMM:  (0x80000000 : schema)
      xapMM:DocumentID = "uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
      xapMM:InstanceID = "uuid:e0500db1-1dd1-11b2-0a00-000000004869"

Pretty serialization, 1478 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
         <xap:CreatorTool>XPP</xap:CreatorTool>
         <xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li/>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Keywords/>
         <pdf:Producer/>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

Compact serialization, 990 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xap="http://ns.adobe.com/xap/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
    xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
   xap:CreateDate="2004-03-16T15:35:47Z"
   xap:CreatorTool="XPP"
   xap:ModifyDate="2007-05-22T12:24:37Z"
   dc:format="application/pdf"
   pdf:Keywords=""
   pdf:Producer=""
   xapMM:DocumentID="uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
   xapMM:InstanceID="uuid:e0500db1-1dd1-11b2-0a00-000000004869">
   <dc:description>
    <rdf:Alt>
     <rdf:li xml:lang="x-default"/>
    </rdf:Alt>
   </dc:description>
   <dc:creator>
    <rdf:Seq>
     <rdf:li/>
    </rdf:Seq>
   </dc:creator>
   <dc:title>
    <rdf:Alt>
     <rdf:li xml:lang="x-default"/>
    </rdf:Alt>
   </dc:title>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>

A test with a more recent paper RNAmmer: consistent and rapid annotation of ribosomal RNA genes . NAR 2007 contains as much information.

So is there any interesting XMP in scientific pdf ? no.

Pierre

YOKOFAKUN