Inspiring discovery through free access to biodiversity knowledge.

The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.
BHL also serves as the literature component of the Encyclopedia of Life .


Table of Contents

Prior Work


From: Mike Lichtenberg
Sent: Monday, June 16, 2008 3:45 PM
To: Chris Freeland

Subject: Identifying illustrations
I may have identified a way to determine if a page contains an illustration.

The _abbyy.xml file (contained in the _abbyy.gz file, which we have NOT been downloading) does identify what it sees as illustrations on a page. The format of this file is something like this:

<?xml version=”1.0” encoding=”UTF-8” ?>
<document … >
<page … >
<block blockType=”Text” … >

<page … >
<block blockType=”Picture” … >


I spot-checked a couple items, and it does seem to do a good job of identifying pages with illustrations, although I noticed a few false-positives. For example, many books have a label attached to one of the first few pages that identifies the book as “Property of…”. The _abbyy.xml file has those labels tagged as “Picture”. But, “real” illustrations do seem to be found correctly, so it’s much better than any alternative that I’ve seen so far.

There are a couple caveats before we incorporate this into our harvesting process. These don’t appear insurmountable, but will need to be investigated.

First, the file is compressed with GZip, so we’d need a way to automatically decompress it.

Second, there is nothing in the _abbyy.xml that identifies a particular page, so we’d be relying on matching the pages to the DJVU and SCANDATA information simply by page position in the file (3rd page in the _abbyy.xml file is the same as the 3rd page in the DJVU file). This is essentially what we’ve done to match the DJVU and SCANDATA information, and that has worked pretty well. I’d assume it would hold true for one more file type, but again, don’t know that for a fact.




Code4Lib "lightning talk" presentation about finding pages with images. Here are the slides:

Here is a direct link to a PDF of the same:

This is particularly interesting approach in that it doesn't rely on the ABBY files. And here's the code (it's written in Ruby):

View Terms Of Use | Privacy
Revised: chrisfreeland Jun 7, 2012 7:51 am (3 revisions)
links to this page | print this page | Visit
[Invalid Include: Page not found: HTML_div_close]
Contributions to are licensed under a Creative Commons Attribution Share-Alike 3.0 License. Creative Commons Attribution Share-Alike 3.0 License
Portions not contributed by visitors are Copyright 2018 Tangient LLC
TES: The largest network of teachers in the world