4 years ago, I played with the adobe XMP library to extract the XMP metadata from a set of PDF files.
Today, I was suprised to simply display the XMP data contained in a PDF from Nature by using the following command line:
I've generalized this process by implementing a
GNU-flex scanner that prints the XML content between two
<xmp:xmpmeta/> tags. The source code is available on github at:
https://github.com/lindenb/ccsandbox/blob/master/src/xmpextractor.l.
Compilation
flex -f -B --read xmpextractor.l
gcc -o xmpextractor lex.yy.c
Testing with "PLOS"
Harper MA, Chen Z, Toy T, Machado IMP, Nelson SF, et al. (2011) Phenotype Sequencing: Identifying the Genes That Cause a Phenotype Directly from Pooled Sequencing of Independent Mutants. PLoS ONE 6(2): e16517. doi:10.1371/journal.pone.0016517:
Hum, nothing really interesting here.
Testing with "Nature Reviews Cardiology"
A test with
Percutaneous coronary intervention in the elderly. Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184curl -s "
http://www.nature.com/nrcardio/journal/v8/n2/pdf/nrcardio.2010.184.pdf" |\
./xmpextractor
<?xml version="1.0" encoding="UTF-8"?>
<XMP><x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="
http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:dc="
http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:identifier>doi:10.1038/nrcardio.2010.184</dc:identifier>
<dc:creator>
<rdf:Seq>
<rdf:li>Tracy Y. Wang</rdf:li>
<rdf:li>Antonio Gutierrez</rdf:li>
<rdf:li>Eric D. Peterson</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Alt>
</dc:description>
<dc:publisher>
<rdf:Bag>
<rdf:li>Nature Publishing Group</rdf:li>
</rdf:Bag>
</dc:publisher>
<dc:rights>
<rdf:Alt>
<rdf:li xml:lang="x-default">
 © 2010 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.</rdf:li>
</rdf:Alt>
</dc:rights>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Percutaneous coronary intervention in the elderly</rdf:li>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:pdf="
http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>Adobe PDF Library 8.0</pdf:Producer>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:prism="
http://prismstandard.org/namespaces/basic/2.0/">
<prism:copyright>© 2010 Nature Publishing Group</prism:copyright>
<prism:doi>10.1038/nrcardio.2010.184</prism:doi>
<prism:eIssn>1759-5010</prism:eIssn>
<prism:endingPage>90</prism:endingPage>
<prism:issn>1759-5002</prism:issn>
<prism:number>2</prism:number>
<prism:publicationName>Nature Publishing Group</prism:publicationName>
<prism:rightsAgent>permissions@nature.com</prism:rightsAgent>
<prism:startingPage>79</prism:startingPage>
<prism:volume>8</prism:volume>
<prism:publicationDate>
<rdf:Bag>
<rdf:li>2010-12-07</rdf:li>
</rdf:Bag>
</prism:publicationDate>
<prism:url>
<rdf:Bag>
<rdf:li>
http://dx.doi.org/10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</prism:url>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmp="
http://ns.adobe.com/xap/1.0/">
<xmp:CreateDate>2011-01-10T10:09:23+05:30</xmp:CreateDate>
<xmp:CreatorTool/>
<xmp:Label>Nature Reviews Cardiology 8, 79 (2010). doi:10.1038/nrcardio.2010.184</xmp:Label>
<xmp:MetadataDate>2011-01-14T18:25:45+05:30</xmp:MetadataDate>
<xmp:ModifyDate>2011-01-14T18:25:45+05:30</xmp:ModifyDate>
<xmp:Identifier>
<rdf:Bag>
<rdf:li>doi:10.1038/nrcardio.2010.184</rdf:li>
</rdf:Bag>
</xmp:Identifier>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpRights="
http://ns.adobe.com/xap/1.0/rights/">
<xmpRights:Marked>True</xmpRights:Marked>
</rdf:Description>
<rdf:Description rdf:about="doi:10.1038/nrcardio.2010.184"
xmlns:xmpMM="
http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>uuid:51421664-c9c6-4657-9fbe-318ac969ca26</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c65fa1fb-b6f3-4a61-8615-07b04871749e</xmpMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
</XMP>
That's more interesting isn't it ?
That's it,
Pierre