(Above) A collage of images from Countway books recently digitized and added to the MHL
This month, the Center for the History of Medicine contributed its two-millionth page-image to the Medical Heritage Library. That number translates into almost 6,000 volumes that have been digitized in their entirety (and downloaded over 90,000 times), or nearly two-thirds of our forecast total contributions to the project.
Those who are interested in the process of library digitization might also be interested to learn more about what those statistics mean in terms of logistics and workflow. What does it take to produce millions of page-images from a collection of hundreds of thousands of rare and fragile books? How much time is required? What are the biggest challenges involved? In this two-part series of blog posts, we will examine a large-scale digitization project from the inside.
At the outset of the project, the Center committed to contributing more than just basic black-and-white scans to the MHL. Along with the other contributing members, we chose to produce high-quality, full-color images accompanied by plain text files created with optical character recognition software (OCR). This allows our users to experience these works either as close approximations of their original physical states (i.e. full color page-turning ebooks), or as simple text files that can be searched, manipulated, and easily read on portable devices. Because we committed ourselves to this level of quality in production, a great deal of effort needs to be put in to every book that we send through the process, and to keep up with this kind of work, a digitization project requires staffing.
Here at the Center, the MHL team consists of six members, two of whom work exclusively on the project. We have two project administrators (Scott Podolski and Kathryn Hammond Baker), two selectors (Jack Eckert and Joan Thomas), one dedicated cataloger/workflow manager (Jay Moschella), and one dedicated part-time employee who works on various aspects of the project including workflow and QC (Sarah Spira).
To stay on top of the work required, our staff needs to meet regularly with one another, and to remain in constant contact with other MHL contributors and service providers, including project funders, MHL overseers, staff members from our numerous partner libraries, our scanning center, our moving company, and others. Without a commitment to open communication and cooperation, a logistically complex digitization project like the MHL would simply not be feasible.
(Above) A 16th c. volume bound in fragile period manuscript waste. (Below) Books bound in 19th c. sheepskin display the characteristic types of degradation that prevent us from digitizing. (Photos: Stephen Jennings)
The Center holds over 200,000 volumes in its rare book collections, and the process of selection for this project weeds out those volumes that are not suitable for digitization. There are several important criteria that our selectors base their decisions on, including subject relevance, the existence of publicly-accessible copies already available on-line, and whether or not the works are still in copyright. A major factor in determining whether or not we can digitize a volume is condition. A large number of our volumes are printed on acidic, embrittled paper, or are bound in extremely fragile or damaged cases that predate industrialized binding techniques. Unfortunately, such materials can not be safely sent through the imaging process without risking further deterioration, and need to be held out until such time as funding for future preservation efforts is made available. Therefore selectors need to carefully screen each volume.
As is the case in many large library systems, a significant portion of our catalog is comprised of “legacy” records, which were created over the years according to now outdated metadata standards. But findability, even in the age of keyword searching, is still largely dependent on uniform input standards, and therefore cleaning up our catalog, and making sure all of our records are consistent and up to date is an essential task. All works on a single topic, like the common cold, for example, will be easier to find for users if they are collated under a single, intuitively searchable heading.
58 linear ft. (or one full shipment) of cataloged books ready for shipment.
But right now, many of these works might still be categorized separately from one another under similar, but distinct headings, like “colds,” “viruses,” “the common cold,” “rhinovirus,” “sickness,” (and so on), while many other works might have no subject headings at all. And just as storing all works on a single topic in one place on the shelves of a library helps patrons track down what they need in person, appending uniform subject headings in a digital library helps patrons all around the world to retrieve better and more useful search results.
Many of these older records were also created using outdated standards for descriptive metadata, which might mean that their titles are incomplete, that authors or editors were not properly entered according to modern standards, or that any number of errors or omissions that could hinder findability might still be present. Here at the Center, we catalog between 600 and 800 volumes every 5 weeks before sending them off to be digitized. While this represents a very serious investment in updating our metadata, we feel that the results (more easily discoverable, well-cataloged titles) more than justify our efforts.
Part two of this blog post will look at the process of physically moving large numbers of rare books between libraries for digization, the work done behind the scenes at the imaging lab, and how we carry out quality control. Please contact us at the Centor for the History of Medicine if you have any questions about this project or the work that we put into it.