Inspiring discovery through free access to biodiversity knowledge.

The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.
BHL also serves as the literature component of the Encyclopedia of Life .

  
Back to Help | Download Help

For information about harvesting our data via OAI, please see our Developer Tools and API page

Data Licensing edit
The BHL makes its metadata available for public use under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication license . This Creative Commons license allows you to reuse, modify, repurpose, and distribute the metadata for all purposes including commercial and non-commercial, with no need to ask for permission.

Metadata in this case, refers to:
  • Library catalog records, i.e. bibliographic data, used to describe the books and journals in the BHL collection (e.g. title and author data).
  • Page level data such as page numbers and pages types (e.g. "Title page" and "Illustration").
  • Scientific name data, e.g. "Zea mays". (For more information about how this data is generated, see the Data Exports page).

Go ahead, take our metadata and do something creative with it! If you do repurpose BHL metadata please share your story with us. We often like to feature stories of reuse on our BHL blog.

Data Exports
A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files also include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.

Download documentation: http://www.biodiversitylibrary.org/data/BHLExportSchema.pdf

Data exports are updated at the beginning of every month in the following formats:
To download these files, Right-Click and choose "Save link as..." or "Save target as..."

MODS

BibTex

Custom

(note: character encoding for all of the text files is Unicode UTF-8)

Back to top


Downloading all files for a book - file types and descriptions


download_all_file_types.PNG
Display of all file types obtained from "Download All"


File Type
Description
Image Quality
.djvu
Similar to PDF, a proprietary compressed document format.
Low; sufficient for printing and reading text
.gif
A looping, animated thumbnail of the first 20 pages of a book. Usually a 100x152 pixel GIF.
N/A (Contains only the first 20 pages of the book)
.pdf
The presentation version on BHL in PDF format.
Low; sufficient for printing and reading text
.abbyy.gz
GZipped version of the full ABBYY FineReader XML output, which includes all character-level information (confidence, location, etc.)
N/A
bw.pdf
A black and white PDF compiled using binarized versions of the images. The binarized images are not made available.
Low; sufficient for printing with low cost of ink or printing only text images
dc.xml
OAI record in Dublin Core (bibliographic description) XML
N/A
djvu.txt
the raw OCR text
N/A
djvu.xml
OCR text formatted as XML that includes word coordinates and rough page formatting (column, paragraph, line, word, etc.)
N/A
files.xml
the manifest that records all of the files available for this book; also gives 2 checksums and a format definition for each file; provides the only mechanism for validating that the component data has been downloaded successfully
N/A
jp2.zip
A ZIP archive of all of the cleaned, cropped, etc. JP2 page images. These jpeg2000 format images have been compressed by 85% but are perfectly fine for most uses.
High; Best for use and printing of plates, illustrations, detailed figures and tables
marc.xml
the MARC (bibliographic description) data in XML
N/A
meta.mrc
the binary MARC record as retrieved using z39.50
N/A
meta.xml
Internet Archive's internal "management" metadata; a proprietary XML format, this file includes information about the scan event (date, # of pages, operator, station, etc.), the contributor, basic bib data (title, author, subject, language), and a set of identifiers
N/A
metasource.xml
a proprietary XML file recording where the MARC record came from (catalog, operator, zquery, etc.)
N/A
names.xml
list, by page, of all the scientific names found in the book; presented in xml format
N/A
orig_jp2.tar
The highest resolution files available for large books whose total storage space exceed the maximum size for a ZIP archive. Images are deskewed, cropped, rotated and compressed by 80%. TAR archives average 2.07 gb.
High; Best for use and printing of plates, illustrations, detailed figures and tables
scandata.xml
a proprietary XML file recording information about each page image (page number, sequence number, handSide, cropBox dimensions, original width & height, etc.)
N/A

Below are some definitions that might be helpful in understanding the file types:
Term
Definition
Dublin Core
A set of metadata elements that provide a small and fundamental group of text elements through which most resources can be described and cataloged; a metadata format for describing resources.
OCR
Stands for "Optical Character Recognition;" the conversion of images of text into text characters
MARC
A bibliographic data format describing standards for the representation and communication of bibliographic and related information in machine-readable form, and related documentation
OAI
"Open Archives Initiative;" develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content; OAI has its roots in the open access and institutional repository movements.
ZIP
a data compression and archive format; contains one or more files that have been compressed to reduce file size
GZIPPED
a software application used for file compression
z39.50
a client-server protocol for searching and retrieving information from remote computer databases
TAR
an archiver that creates and handles file archives in various formats; can be used to create file archives, to extract files from previously created archives, store additional files, or update or list files which were already stored.

Back to top



View Terms Of Use | Privacy
Revised: joelrichard Apr 10, 2017 9:41 am (42 revisions)
links to this page | print this page | Visit http://biodiversitylibrary.org
Contributions to https://biodivlib.wikispaces.com/ are licensed under a Creative Commons Attribution Share-Alike 3.0 License. Creative Commons Attribution Share-Alike 3.0 License
Portions not contributed by visitors are Copyright 2017 Tangient LLC
TES: The largest network of teachers in the world