Inspiring discovery through free access to biodiversity knowledge.

The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.
BHL also serves as the literature component of the Encyclopedia of Life .

Back to Help | Download Help

For information about harvesting our data via OAI, please see our Developer Tools and API page

Data Licensing edit
The BHL makes its metadata available for public use under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license . This Creative Commons license allows you to reuse, modify, repurpose, and distribute the metadata for all purposes including commercial and non-commercial, with no need to ask for permission.

Metadata in this case, refers to:
  • Library catalog records, i.e. bibliographic data, used to describe the books and journals in the BHL collection (e.g. title and author data).
  • Page level data such as page numbers and pages types (e.g. "Title page" and "Illustration").
  • Scientific name data, e.g. "Zea mays". (For more information about how this data is generated, see the Data Exports page).

Go ahead, take our metadata and do something creative with it! If you do repurpose BHL metadata please share your story with us. We often like to feature stories of reuse on our BHL blog.

Data Exports
A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files also include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.

Download documentation:

Data exports are updated at the beginning of every month in the following formats:
To download these files, Right-Click and choose "Save link as..." or "Save target as..."




(note: character encoding for all of the text files is Unicode UTF-8)

Data Quality

BHL is moving to implement the KBART standard for better integrating our data into various discovery layer tools in the future. Our data for digitized legacy materials is sourced and aggregated from our consortium library partner catalogs "as is" and we lack the resources to refine it at this time. Until BHL can implement KBART, any data that may be present in discovery layer tools is likely incomplete. Alternatively, you can find our bibliographic records available via our website, some consortium partner library catalogs, Internet Archive’s “biodiversity” collection, and the Digital Public Library of America (DPLA). Projects to integrate our records into OCLC and Europeana are underway. If you have questions about working with BHL bibliographic data, please contact us.

Back to top

Downloading all files for a book - file types and descriptions

Display of all file types obtained from "Download All"

File Type
Image Quality
Similar to PDF, a proprietary compressed document format.
Low; sufficient for printing and reading text
A looping, animated thumbnail of the first 20 pages of a book. Usually a 100x152 pixel GIF.
N/A (Contains only the first 20 pages of the book)
The presentation version on BHL in PDF format.
Low; sufficient for printing and reading text
GZipped version of the full ABBYY FineReader XML output, which includes all character-level information (confidence, location, etc.)
A black and white PDF compiled using binarized versions of the images. The binarized images are not made available.
Low; sufficient for printing with low cost of ink or printing only text images
OAI record in Dublin Core (bibliographic description) XML
the raw OCR text
OCR text formatted as XML that includes word coordinates and rough page formatting (column, paragraph, line, word, etc.)
the manifest that records all of the files available for this book; also gives 2 checksums and a format definition for each file; provides the only mechanism for validating that the component data has been downloaded successfully
A ZIP archive of all of the cleaned, cropped, etc. JP2 page images. These jpeg2000 format images have been compressed by 85% but are perfectly fine for most uses.
High; Best for use and printing of plates, illustrations, detailed figures and tables
the MARC (bibliographic description) data in XML
the binary MARC record as retrieved using z39.50
Internet Archive's internal "management" metadata; a proprietary XML format, this file includes information about the scan event (date, # of pages, operator, station, etc.), the contributor, basic bib data (title, author, subject, language), and a set of identifiers
a proprietary XML file recording where the MARC record came from (catalog, operator, zquery, etc.)
list, by page, of all the scientific names found in the book; presented in xml format
The highest resolution files available for large books whose total storage space exceed the maximum size for a ZIP archive. Images are deskewed, cropped, rotated and compressed by 80%. TAR archives average 2.07 gb.
High; Best for use and printing of plates, illustrations, detailed figures and tables
a proprietary XML file recording information about each page image (page number, sequence number, handSide, cropBox dimensions, original width & height, etc.)

Below are some definitions that might be helpful in understanding the file types:
Dublin Core
A set of metadata elements that provide a small and fundamental group of text elements through which most resources can be described and cataloged; a metadata format for describing resources.
Stands for "Optical Character Recognition;" the conversion of images of text into text characters
A bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation
"Open Archives Initiative;" develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content; OAI has its roots in the open access and institutional repository movements.
a data compression and archive format; contains one or more files that have been compressed to reduce file size
a software application used for file compression
a client-server protocol for searching and retrieving information from remote computer databases
an archiver that creates and handles file archives in various formats; can be used to create file archives, to extract files from previously created archives, store additional files, or update or list files which were already stored.

Back to top

View Terms Of Use | Privacy
Revised: lipscombb Nov 21, 2017 1:35 pm (46 revisions)
links to this page | print this page | Visit
[Invalid Include: Page not found: HTML_div_close]
Contributions to are licensed under a Creative Commons Attribution Share-Alike 3.0 License. Creative Commons Attribution Share-Alike 3.0 License
Portions not contributed by visitors are Copyright 2018 Tangient LLC
TES: The largest network of teachers in the world