YOKOFAKUN: adobe

22 February 2011

A flex scanner extracting the metadata from a PDF file.

4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.

Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
strings |\
grep -A 100 "<x:xmpmeta"

I've generalized this process by implementing a GNU-flex scanner that prints the XML content between two <xmp:xmpmeta/> tags. The source code is available on github at: https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.

Compilation

flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c

Testing with "PLOS"

Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:

curl -s "http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0016517&representation=PDF" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP toolkit 2.9.1-13, framework 1.6">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:iX="http://ns.adobe.com/iX/1.0/">
<rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" pdf:Producer="Acrobat Distiller 7.0 (Windows)"/>
<rdf:Description xmlns:xap="http://ns.adobe.com/xap/1.0/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xap:CreateDate="2011-02-14T08:21:58+08:00" xap:CreatorTool="3B2 Total Publishing System 7.51n/W" xap:ModifyDate="2011-02-17T14:32:08+08:00" xap:MetadataDate="2011-02-17T14:32:08+08:00"/>
<rdf:Description xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" xapMM:DocumentID="uuid:f98295b5-0980-42f2-884e-6cecd2d75c90" xapMM:InstanceID="uuid:d226393a-bfaf-4a2b-8c28-08215008694e"/>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="uuid:a0b7f786-d005-411e-b6e8-61fac5a69e23" dc:format="application/pdf">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">pone.0016517 1..16</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

Hum, nothing really interesting here.

Testing with "Nature Reviews Cardiology"

A test with Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184

curl -s "http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor

<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">
 © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>

That's more interesting isn't it ?

That's it,

Pierre

22 May 2007

Is there any XMP in scientific pdf ? (No)

Roderic Page from iPhylo has introduced XMP in his blog. XMP is an Adobe format used to store metadata in files, such as PDFs. Adobe also provides an API to extract the XMP from the files.

I've downloaded the toolkit to see if any meta information could be extracted from the scientific papers. The adobe toolkit needs expat (a XML parser) to be installed and it comes with a sample application 'DumpScannedXMP' finding all XMP Packets in a file and printing their content.

I've tested this with some papers found on the net.
./target/i80386linux/debug/DumpScannedXMP 3851.pdf
RoXaN, a Novel Cellular Protein Containing TPR, LD, and Zinc Finger Motifs, Forms a Ternary Complex with Eukaryotic Initiation Factor 4G and Rotavirus NSP3: from Journal of Virology 2003

// ==============================================================

// Dumping raw input for "/home/pierre/3851.pdf" (879724..881254)

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
         <xap:CreatorTool>XPP</xap:CreatorTool>
         <xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li/>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Keywords/>
         <pdf:Producer/>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>

Dumping XMPMeta object ""  (0x0)

   http://ns.adobe.com/xap/1.0/  xap:  (0x80000000 : schema)
      xap:CreateDate = "2004-03-16T15:35:47Z"
      xap:CreatorTool = "XPP"
      xap:ModifyDate = "2007-05-22T12:24:37Z"

   http://purl.org/dc/elements/1.1/  dc:  (0x80000000 : schema)
      dc:format = "application/pdf"
      dc:description  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = ""  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)
      dc:creator  (0x600 : isOrdered isArray)
         [1] = ""
      dc:title  (0x1E00 : isLangAlt isAlt isOrdered isArray)
         [1] = ""  (0x50 : hasLang hasQual)
               ? xml:lang = "x-default"  (0x20 : isQual)

   http://ns.adobe.com/pdf/1.3/  pdf:  (0x80000000 : schema)
      pdf:Keywords = ""
      pdf:Producer = ""

   http://ns.adobe.com/xap/1.0/mm/  xapMM:  (0x80000000 : schema)
      xapMM:DocumentID = "uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
      xapMM:InstanceID = "uuid:e0500db1-1dd1-11b2-0a00-000000004869"

Pretty serialization, 1478 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xap="http://ns.adobe.com/xap/1.0/">
         <xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
         <xap:CreatorTool>XPP</xap:CreatorTool>
         <xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li/>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Keywords/>
         <pdf:Producer/>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
         <xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
         <xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>

Compact serialization, 990 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:xap="http://ns.adobe.com/xap/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
    xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
   xap:CreateDate="2004-03-16T15:35:47Z"
   xap:CreatorTool="XPP"
   xap:ModifyDate="2007-05-22T12:24:37Z"
   dc:format="application/pdf"
   pdf:Keywords=""
   pdf:Producer=""
   xapMM:DocumentID="uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
   xapMM:InstanceID="uuid:e0500db1-1dd1-11b2-0a00-000000004869">
   <dc:description>
    <rdf:Alt>
     <rdf:li xml:lang="x-default"/>
    </rdf:Alt>
   </dc:description>
   <dc:creator>
    <rdf:Seq>
     <rdf:li/>
    </rdf:Seq>
   </dc:creator>
   <dc:title>
    <rdf:Alt>
     <rdf:li xml:lang="x-default"/>
    </rdf:Alt>
   </dc:title>
  </rdf:Description>
 </rdf:RDF>
</x:xmpmeta>

A test with a more recent paper RNAmmer: consistent and rapid annotation of ribosomal RNA genes . NAR 2007 contains as much information.

So is there any interesting XMP in scientific pdf ? no.

Pierre

YOKOFAKUN

22 February 2011

A flex scanner extracting the metadata from a PDF file.

Compilation

Testing with "PLOS"

Testing with "Nature Reviews Cardiology"

22 May 2007

Is there any XMP in scientific pdf ? (No)

About Me

Feeds

Blog Archive

Web2.0

Labels