Inspiring discovery through free access to biodiversity knowledge.

The Biodiversity Heritage Library improves research methodology by collaboratively making biodiversity literature openly available to the world as part of a global biodiversity community.
BHL also serves as the literature component of the Encyclopedia of Life .


Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts

Although this project ended in Nov 2015, both Smorball and Beanstalk games will continue to be available in 2016 at http://smorballgame.org__ and http://beanstalkgame.org__ and the input will continue to improve OCR output from BHL. Thank you for playing and helping improve access to science resource!

About Smorball - Smorball wins "Best Serious Game" award at Boston Festival of Indie Games! BFIG logo.png

Players of the more challenging Smorball game are asked to type the words they see as quickly and accurately as possible to help coach their team, the Eugene Melonballers, to victory to win the coveted Dalahäst Trophy in the fictional sport of Smorball. Each word typed correctly defeats an opposing smorbot and brings the Melonballers closer to the championships.




About Beanstalk -
Players of the more relaxed Beanstalk game must type the words presented to them correctly in order to grow their beanstalk from a tiny tendril to a massive cloudscraper. The more words they type correctly, the faster the beanstalk grows. Players who accurately transcribe the most words will ascend to the top of the leaderboard as a result of their valuable contributions.



Both Smorball and Beanstalk were designed by Tiltfactor and are licensed as Free and Open Source Software (FOSS).

We're not currently integrating material from other institutions in OUR build of the game, but the good news is the games and their supporting software are open source so you can fairly easily host your own.

There are a few steps to hosting your own Smorball or Beanstalk games:
1. Prepare your material. The games are OCR correction games, and in order for them to function they take data in the form of single words that different OCR software disagree on their interpretations of. Each "difference" sent to the games must have a page image URL, a location on that page image, and two strings that represent what the two OCR software THINK the word is. It's from these two strings that the games estimate whether or not the player has typed the right answer.
2. Host the game(s) and the game backend. You can find the game code here: and the code for the game database and data management server here:
3. Configure the games. If you want to run Beanstalk, make sure your version of Beanstalk has its own high score database (via If you want the facebook and twitter buttons in your Smorball to go to your social media accounts, generate facebook and twitter developer API keys, etc.

This project, which has been generously funded by the Institute of Museum and Library Services (IMLS), aims to significantly improve access to digital texts through the applicability of purposeful gaming for the completion of data enhancement tasks needed for content found within the Biodiversity Heritage Library (BHL). This project tackles a major challenge for digital libraries: full-text searching of texts is significantly hampered by poor output from Optical Character Recognition (OCR) software. Historic literature has proven to be particularly problematic because of its tendency to have varying fonts, typesetting, and layouts that make it difficult to accurately render. The European Union’s IMPACT project, a 2008-2012 effort to improve access to texts states that poor OCR does in many cases not produce satisfying results for historical documents. Recognition rates are poor or even useless. No commercial or other OCR engine is able to cope satisfactorily with the wide range of printed materials published between the start of the Gutenberg age in the 15th century and the start of the industrial production of books in the middle of the 19th century.” This state of affairs illustrates the pressing need to identify additional solutions to OCR for improving access to digital texts.

The BHL is an international consortium of the world’s leading natural history libraries, including the Missouri Botanical Garden’s Peter H. Raven Library, that have collaborated to digitize the public domain literature documenting the world’s biological diversity. This has resulted in the single largest, open-licensed source of biodiversity literature made available both through the Internet Archive and through a customized portal at BHL is a perfect testbed for investigating alternate solutions to the generation of digital outputs both because it is a significantly large corpus (41 million pages of scanned texts accompanied by 41 million OCR outputs) and because most of its content is historic literature (the majority of BHL content was published between 1450s-1900s). OCR is also largely ineffective on hand-written texts such as field notebooks–a growing content type in the BHL.

Purposeful Gaming and BHL will demonstrate whether or not digital games are a successful tool for analyzing and improving digital outputs from OCR and transcription activities because large numbers of users can be harnessed quickly and efficiently to focus on the review and correction of particularly problematic words by being presented the task as a game.

The project runs from December 1, 2013 through November 30, 2015 and will be conducted by the Missouri Botanical Garden's Center for Biodiversity Informatics (CBI) in partnership with Harvard University, Cornell University, and the New York Botanical Garden.
A sample of poor OCR output from an 18th century publication.
This page is from Linneaus' Species Plantarum published in 1753
An image of the original text is on the left. The OCR is on the right.
Bad OCR sample.PNG

A sample of poor OCR output from a hand written text. This page is from the Diaries of William Brewster, 1865-1919
brewster diaries fieldnotebook sample.PNG

This project was made possible in part by the Institute of Museum and Library Services [LG-05-13-0352-13]

Project Team
Missouri Botanical Garden
  • Trish Rose-Sandler, Data Project Coordinator, Center for Biodiversity Informatics
  • William Ulate, Senior Project Coordinator, Center for Biodiversity Informatics
  • Mike Lichtenberg, Programmer, Center for Biodiversity Informatics
  • Stephen Kappel, Programmer, Center for Biodiversity Informatics
  • Doug Holland, Director, Peter H. Raven Library
  • Mike Blomberg, Imaging Lab Coordinator, Peter H. Raven Library
  • Chuck Miller, Vice President of Information Technology and Chief Information Officer

Ernst Mayr Library of the Museum of Comparative Zoology at Harvard University
  • James Hanken, Director of the Museum of Comparative Zoology
  • Constance Rinaldo, Librarian of the Ernst Mayr Library
  • Joe deVeer, Project Manager
  • Robert Young, Special Collections Librarian
  • Patrick Randall, Outreach and Communications

The LuEsther T. Mertz Library, New York Botanical Garden
  • Susan Fraser, Director
  • Susan Lynch, Systems Librarian
  • John Mignault, Systems Librarian (previous)
  • Kevin Nolan, Digital Projects Manager
  • Lisa Studier, Metadata Cataloger
  • Yumi Choi, Catalog Librarian
  • Andrew Tschinkel, Scanning Technician
  • Paul Silverman, Scanning Technician

Cornell University Library
  • Martin Schlabach, Librarian
  • Kevin Nixon, Professor of Botany
  • Holly Mistlebauer

Original Proposal & Schedule

Project Narrative

Schedule of Completion

Workflow diagram

Word comparison across outputs



Media Coverage

Games Coverage

Featured Texts

Choice of Game Designer

Initial Grant Award


  • We recently joined the Crowdsourcing Consortium for Libraries and Archives (CCLA) . Supported by the Institute of Museum and Library Services, the goal of CCLA is to create a forum that enables all interested stakeholders to join a national conversation about the most pressing needs and challenges regarding the development and deployment of crowdsourcing technologies in the cultural heritage domain
  • Here is a page by Chris Freeland that covers the history of the thinking behind using games with BHL content.
  • Excellent summary post by Ben W. Brumfield on QC for Collaborative (Crowdsourced) Manuscript Transcription at
  • Discussion minutes, software developed and presentations recorded from the Notes from Nature/iDigBio Hackathon to Further Enable Public Participation in the Online Transcription of Biodiversity Specimen Labels on December 16-20 at the University of Florida in Gainsville.

Contact Us

  • For more information please contact the project's Principal Investigator, Trish Rose-Sandler at 314-577-9473 x6396 or

· Max J. Seidman, Dr. Mary Flanagan, Trish Rose-Sandler, and Mike Lichtenberg, Are games a viable solution to crowdsourcing improvements to faulty OCR? – The Purposeful Gaming and BHL experience”, Code4Lib Journal, Issue 33, July 2016
Contributions to are licensed under a Creative Commons Attribution Share-Alike 3.0 License. Creative Commons Attribution Share-Alike 3.0 License
Portions not contributed by visitors are Copyright 2018 Tangient LLC
TES: The largest network of teachers in the world