22 May 2007

Is there any XMP in scientific pdf ? (No)

Roderic Page from iPhylo has introduced XMP in his blog. XMP is an Adobe format used to store metadata in files, such as PDFs. Adobe also provides an API to extract the XMP from the files.

I've downloaded the toolkit to see if any meta information could be extracted from the scientific papers. The adobe toolkit needs expat (a XML parser) to be installed and it comes with a sample application 'DumpScannedXMP' finding all XMP Packets in a file and printing their content.

I've tested this with some papers found on the net.
./target/i80386linux/debug/DumpScannedXMP 3851.pdf
RoXaN, a Novel Cellular Protein Containing TPR, LD, and Zinc Finger Motifs, Forms a Ternary Complex with Eukaryotic Initiation Factor 4G and Rotavirus NSP3: from Journal of Virology 2003

// ==============================================================

// Dumping raw input for "/home/pierre/3851.pdf" (879724..881254)

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-701">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xap="http://ns.adobe.com/xap/1.0/">
<xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
<xap:CreatorTool>XPP</xap:CreatorTool>
<xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li/>
</rdf:Seq>
</dc:creator>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Keywords/>
<pdf:Producer/>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
<xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
<xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="r"?>

Dumping XMPMeta object "" (0x0)

http://ns.adobe.com/xap/1.0/ xap: (0x80000000 : schema)
xap:CreateDate = "2004-03-16T15:35:47Z"
xap:CreatorTool = "XPP"
xap:ModifyDate = "2007-05-22T12:24:37Z"

http://purl.org/dc/elements/1.1/ dc: (0x80000000 : schema)
dc:format = "application/pdf"
dc:description (0x1E00 : isLangAlt isAlt isOrdered isArray)
[1] = "" (0x50 : hasLang hasQual)
? xml:lang = "x-default" (0x20 : isQual)
dc:creator (0x600 : isOrdered isArray)
[1] = ""
dc:title (0x1E00 : isLangAlt isAlt isOrdered isArray)
[1] = "" (0x50 : hasLang hasQual)
? xml:lang = "x-default" (0x20 : isQual)

http://ns.adobe.com/pdf/1.3/ pdf: (0x80000000 : schema)
pdf:Keywords = ""
pdf:Producer = ""

http://ns.adobe.com/xap/1.0/mm/ xapMM: (0x80000000 : schema)
xapMM:DocumentID = "uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
xapMM:InstanceID = "uuid:e0500db1-1dd1-11b2-0a00-000000004869"

Pretty serialization, 1478 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xap="http://ns.adobe.com/xap/1.0/">
<xap:CreateDate>2004-03-16T15:35:47Z</xap:CreateDate>
<xap:CreatorTool>XPP</xap:CreatorTool>
<xap:ModifyDate>2007-05-22T12:24:37Z</xap:ModifyDate>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li/>
</rdf:Seq>
</dc:creator>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Keywords/>
<pdf:Producer/>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/">
<xapMM:DocumentID>uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858</xapMM:DocumentID>
<xapMM:InstanceID>uuid:e0500db1-1dd1-11b2-0a00-000000004869</xapMM:InstanceID>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>

Compact serialization, 990 bytes :

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Public XMP Toolkit Core 3.5">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:xap="http://ns.adobe.com/xap/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xapMM="http://ns.adobe.com/xap/1.0/mm/"
xap:CreateDate="2004-03-16T15:35:47Z"
xap:CreatorTool="XPP"
xap:ModifyDate="2007-05-22T12:24:37Z"
dc:format="application/pdf"
pdf:Keywords=""
pdf:Producer=""
xapMM:DocumentID="uuid:e0500da6-1dd1-11b2-0a00-ecd00f090858"
xapMM:InstanceID="uuid:e0500db1-1dd1-11b2-0a00-000000004869">
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li/>
</rdf:Seq>
</dc:creator>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
</rdf:Alt>
</dc:title>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>


A test with a more recent paper RNAmmer: consistent and rapid annotation of ribosomal RNA genes . NAR 2007 contains as much information.

So is there any interesting XMP in scientific pdf ? no.


Pierre

4 comments:

Stew said...

Unfortunately the tools to write XMP into PDFs haven't been particularly easy to incorporate into publishing pipelines until fairly recently (reading XMP is much easier).

We've been taking another look at it at Nature recently, though. As Rod pointed out it makes sense to keep metadata with the object it's all about.

Anonymous said...

If the PDF file already contains XMP (most do if created by Adobe tools) then the libraries can update the XMP with new metadata. Currently there are no free libraries to inject XMP into a PDF that doesn't have XMP already - you would need to license the PDF library.

Anonymous said...

Hi Pierre:

You are right, but we are at least beginning to look into this area of adding XMP into media files. See this post on CrossTech.

Do feel free to comment back there.

Cheers,

Tony

Unknown said...

Hello:
the problem is laziness. Once we used to download our pdfs and store them in a folder without bothering about bibliographic software. Now I have tons of pdfs with no reference , i.e. xmp. It would be great to drag and drop a pdf file with xmp and, voilà, every single field of a referencemanger filled up.
Daniele Pontillo, MD