Art+of+Life-Prior+Work

toc =Prior Work=

2008
**Sent:** Monday, June 16, 2008 3:45 PM **To:** Chris Freeland
 * From: ** Mike Lichtenberg

 **Subject:** Identifying illustrations I may have identified a way to determine if a page contains an illustration.

The _abbyy.xml file (contained in the _abbyy.gz file, which we have NOT been downloading) does identify what it sees as illustrations on a page. The format of this file is something like this:

    …   …

I spot-checked a couple items, and it does seem to do a good job of identifying pages with illustrations, although I noticed a few false-positives. For example, many books have a label attached to one of the first few pages that identifies the book as “Property of…”. The _abbyy.xml file has those labels tagged as “Picture”. But, “real” illustrations do seem to be found correctly, so it’s much better than any alternative that I’ve seen so far.

There are a couple caveats before we incorporate this into our harvesting process. These don’t appear insurmountable, but will need to be investigated.

First, the file is compressed with GZip, so we’d need a way to automatically decompress it.

<span style="font-family: 'Arial','sans-serif'; font-size: 13.3333px;">Second, there is nothing in the _abbyy.xml that identifies a particular page, so we’d be relying on matching the pages to the DJVU and SCANDATA information simply by page position in the file (3rd page in the _abbyy.xml file is the same as the 3rd page in the DJVU file). This is essentially what we’ve done to match the DJVU and SCANDATA information, and that has worked pretty well. I’d assume it would hold true for one more file type, but again, don’t know that for a fact.

<span style="font-family: 'Arial','sans-serif'; font-size: 13.3333px;">MIKE

2010

 * Image extraction from Internet Archive (prototype): [|http://ia600408.us.archive.org/~rkumar/extractimgs.php?id=kewgardens00monciala&dir=/6/items/kewgardens00monciala]

2012
Code4Lib "lightning talk" presentation about finding pages with images. Here are the slides: []

Here is a direct link to a PDF of the same: []

This is particularly interesting approach in that it doesn't rely on the ABBY files. And here's the code (it's written in Ruby): []

include page="include_pagefooter"