Preserving and providing greater access to:
The Foundation of Knowledge
The principal purpose of this project is to demonstrate the effectiveness and efficiency of applying wavelet compression technology to the task of delivering conventionally published information, in original page format, via the Internet. As a working server has already been accomplished using local resources, the focus of this effort will be to expand content and develop a schema of best practices as well as gain empirical knowledge of actual productivity potential. Based on what has been done, a rate of one, three to four hundred page, volume, per scanner, per workday, should be possible. The condition of the material being imaged and the motivation and carefulness of the person scanning the item will obviously have a variable affect on output rate and this is something to be assessed.
With ten, reasonably fast, large format (11x17 inch) scanners, such as the Epson 1640XL, which we have used successfully for this purpose, there could be a potential output of twenty-five hundred volumes per year. Variable volume size, condition, page-count, etc. combined with rate mitigation factors such as employee absence, scanning mistakes, encoding errors, etc., will probably realistically mean an output of two thousand or so titles per year. Production of fully searchable, network portable, surrogates at this rate can rapidly improve access as well as artifact survival among at-risk materials.
Criteria for selecting titles to make available in this fashion will include current demand and availability, threat of damage from use, rarity, printing date, and teaching faculty request. In general this technology will allow users easier access to rare items as well as insulate older, often used works, from further patron wear and tear. This last point is important in terms of maintaining public domain access rights to sufficiently old materials as often only works actually printed prior to the cutoff date qualify for general redistribution without permission from the publisher.
Typically the publication will be scanned at three hundred dots per inch with full color. Volumes will be captured full view or two pages per scan, which is usually possible except for folios. This reduces handling of materials as well as the time required to image a work, i.e. a three hundred page book can be captured with one hundred and fifty scans. Generally the rate of acquisition will average around a scan per minute, so capturing a three hundred page volume can be accomplished within two and one half to three hours.
After the image capture process the uncompressed TIF format files are copied to the DjVu encoder machine where the files will be converted to the wavelet based, layered, DjVu format. Next, with the same computer and software, the DjVu files will be scanned for text via OCR (optical character recognition) and image thumbnails will be generated. On a contemporary machine these two steps will generally take less than an hour.
At this point the files only need to be placed in a directory of a web-server with a suitably modified HTML template to become accessible over the Internet. This task should not take more than an hour. The final operation on the scanned images is to preserve the original files by transferring those files to archival storage such as optical or magnetic disk. This step will generally require a half hour or less to complete and does not require continuous worker presence.
As most of the titles will have MARC records, all that is required to make the item findable and usable simultaneously by as many patrons as desire access is a Persistent Uniform Resource Locator (PURL) in the 856 field of the item’s record. In addition to the online catalog access this step provides, there will be a front-end for materials formatted in this fashion that will allow complex searches as well as casual browsing.
Free patron access to published information is the laudable, and continuing, work of the institutional library. Adding difficulty to this task is the need to preserve availability of materials as time and patron use increase the fragility of those materials. Remedial actions such as replacement, rebinding, or purchasing reprints to retain access become larger scale and generally more costly tasks as time moves forward. Information preservation options such as microfilm have their own associated production, storage, and replacement costs combined with a general reduction in the ease of access. Ergo the quest for some means of employing digital technology to mitigate the time and resources devoted to maintaining older materials while still allowing full access to these items. The goal of the project described below is to show that the means to employ digital techniques at a meaningful scale and in a cost effective manner is now available.
Key to presenting conventionally published materials via the Internet is file size—this must be small enough to move from server to client without an inordinate wait for display. Currently file size is most commonly reduced by using optical character recognition on scanned image files to render the text of the work in ASCII characters, which then must be checked to ensure that the software correctly interpreted the original text. Illustrations within a publication present another challenge and often are either omitted or degraded severely in quality for the sake of reduced file size. In general this is a labor intensive reformatting of the original that can still leave a researcher wondering how true the digital version is to the original. Fortunately there is now a viable alternative to this arduous approach.
The technology that we have successfully utilized to easily make items available over the Internet is the layered, wavelet based, format, DjVu (pronounced déjà vu), which was developed in the late 1990s by an AT&T research team. This format is now licensed to the imaging company LizardTech and is currently being used by several companies as well as government agencies for document management. Recently The New Yorker chose this format to make available all of their prior issues.
A typical three hundred page volume should require no more than five hours of student time to capture and process. At six dollars per student hour the operation comes to thirty dollars per volume or ten cents per page. While this is more costly than rebinding it will almost always be less expensive than replacement or preservation photocopy and there should be no subsequent expense as with physical maintenance of titles. A volume of this size generally requires less than ten megabytes of storage so one two hundred gigabyte drive could harbor twenty thousand, three hundred page items. Drives of this size are common and inexpensive so server storage cost is not really a significant factor. Clearly, making a million pages, that are currently printed on paper only, directly viewable, as well as searchable, via the Internet for less than one hundred thousand dollars is a prudent pecuniary investment as well as a wise approach to preserving and augmenting access to information.
If you are in a position to move this project forward with a monetary contribution we have established a Foundation of Knowledge Fund to receive such gifts. For more information or to leave a comment regarding this effort the following email address is provided: