Data+Exports

Back to Help | Download Help

//For information about harvesting our data via OAI, please see our Developer Tools and API page// toc

include component="page" wikiName="biodivlib" page="Data Licensing" title="Data Licensing" editable="1" wrap="1"

A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files also include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.
 * Data Exports **

Download documentation: []

//**To download these files, Right-Click and choose "Save link as..." or "Save target as..."**//
 * Data exports are updated at the beginning of every month in the following formats:**

MODS

 * [|Download BHL Titles in MODS XML] (11MB+)
 * [|Download BHL Items/Volumes in MODS XML] (25MB+)
 * [|Download BHL Parts in MODS XML] (18MB+)

BibTex

 * [|Download BHL Titles in BibTeX format] (9MB+)
 * [|Download BHL Items/Volumes in BibTeX format] (12MB+)
 * [|Download BHL Parts in BibTeX format] (9MB+)

Custom
//(note: character encoding for all of the text files is Unicode UTF-8)//
 * [|Download contents of Title table as a tab-delimited text file] (29MB+)
 * [|Download contents of TitleIdentifier table as a tab-delimited text file] (8MB+)
 * [|Download contents of DOI table as a tab-delimited text file] (6MB+)
 * [|Download contents of Item (volumes) table as a tab-delimited text file] (31MB+)
 * [|Download contents of Subject table as a tab-delimited text file] (12MB+)
 * [|Download contents of Creator table as a tab-delimited text file] (11MB+)
 * [|Download contents of Part table as a tab-delimited text file] (37MB+)
 * [|Download contents of PartCreator table as a tab-delimited text file] (7MB+)
 * [|Download .zip file of all tables (including page and name data)](2GB+) Not for the faint of heart! It's a monster file because it includes the export of data regarding each of our millions of pages as well as the millions of occurrences of scientific names identified in the BHL corpus through indexing by TaxonFinder.

=Data Quality= include component="page" wikiName="biodivlib" page="KBART" editable="1" wrap="1"

Back to top

 =Downloading all files for a book - file types and descriptions=




 * **File Type** || **Description** || **Image Quality** ||
 * .djvu || Similar to PDF, a proprietary compressed document format. || Low; sufficient for printing and reading text ||
 * .gif || A looping, animated thumbnail of the first 20 pages of a book. Usually a 100x152 pixel GIF. || N/A (Contains only the first 20 pages of the book) ||
 * .pdf || The presentation version on BHL in PDF format. || Low; sufficient for printing and reading text ||
 * .abbyy.gz || GZipped version of the full ABBYY FineReader XML output, which includes all character-level information (confidence, location, etc.) || N/A ||
 * bw.pdf || A black and white PDF compiled using binarized versions of the images. The binarized images are not made available. || Low; sufficient for printing with low cost of ink or printing only text images ||
 * dc.xml || OAI record in Dublin Core (bibliographic description) XML || N/A ||
 * djvu.txt || the raw OCR text || N/A ||
 * djvu.xml || OCR text formatted as XML that includes word coordinates and rough page formatting (column, paragraph, line, word, etc.) || N/A ||
 * files.xml || the manifest that records all of the files available for this book; also gives 2 checksums and a format definition for each file; provides the only mechanism for validating that the component data has been downloaded successfully || N/A ||
 * jp2.zip || A ZIP archive of all of the cleaned, cropped, etc. JP2 page images. These jpeg2000 format images have been compressed by 85% but are perfectly fine for most uses. || High; Best for use and printing of plates, illustrations, detailed figures and tables ||
 * marc.xml || the MARC (bibliographic description) data in XML || N/A ||
 * meta.mrc || the binary MARC record as retrieved using z39.50 || N/A ||
 * meta.xml || Internet Archive's internal "management" metadata; a proprietary XML format, this file includes information about the scan event (date, # of pages, operator, station, etc.), the contributor, basic bib data (title, author, subject, language), and a set of identifiers || N/A ||
 * metasource.xml || a proprietary XML file recording where the MARC record came from (catalog, operator, zquery, etc.) || N/A ||
 * names.xml || list, by page, of all the scientific names found in the book; presented in xml format || N/A ||
 * orig_jp2.tar || The highest resolution files available for large books whose total storage space exceed the maximum size for a ZIP archive. Images are deskewed, cropped, rotated and compressed by 80%. TAR archives average 2.07 gb. || High; Best for use and printing of plates, illustrations, detailed figures and tables ||
 * scandata.xml || a proprietary XML file recording information about each page image (page number, sequence number, handSide, cropBox dimensions, original width & height, etc.) || N/A ||


 * Below are some definitions that might be helpful in understanding the file types: **
 * **Term** || **Definition** ||
 * Dublin Core || A set of metadata elements that provide a small and fundamental group of text elements through which most resources can be described and cataloged; a metadata format for describing resources. ||
 * OCR || Stands for "Optical Character Recognition;" the conversion of images of text into text characters ||
 * MARC || A bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation ||
 * OAI || "Open Archives Initiative;" develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content; OAI has its roots in the open access and institutional repository movements. ||
 * ZIP || a data compression and archive format; contains one or more files that have been compressed to reduce file size ||
 * GZIPPED || a software application used for file compression ||
 * z39.50 || a client-server protocol for searching and retrieving information from remote computer databases ||
 * TAR || an archiver that creates and handles file archives in various formats; can be used to create file archives, to extract files from previously created archives, store additional files, or update or list files which were already stored. ||

Back to top

include page="include_pagefooter"