GREENSTONE DIGITAL LIBRARY

FROM PAPER TO COLLECTION

Dr Michel Loots, Dan Camarzan and Ian H. Witten

Human Info NGO, Belgium
Simple Words, Romania
University of Waikato, New Zealand

Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source software, available from http://greenstone.org under the terms of the G nu General Public License.

We want to ensure that this software works well for you. Please report any problems to [email protected]

Greenstone gsdl-2.50  March 2004

About this manual

This document explains how to create CD-ROM collections from paper documents. It describes in full detail the procedures and economics involved in the scanning and optical character recognition (OCR) processes, so that you end up with text in the right format to apply the Greenstone software. It also describes how to create and edit the material associated with a collection.

We have tried to be as plain as possible in our explanation. Reference to any trade mark or company product is purely for illustrative purposes, and does not imply that we endorse or favor this product over any other.

Companion documents

The complete set of Greenstone documents include five volumes:

Copyright

Copyright © 2002 2003 2004 2005 2006 2007 by the New Zealand Digital Library Project at the University of Waikato, New Zealand.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”

Acknowledgements

The scanning operation and other know-how relating to the creation of collaborative non-profit collections have been developed by Dr Michel Loots, MD, of Human Info NGO and HumanityCD, Dan Camarzan of Simple Words, and their team of collaborators in Brasov, Romania.

The Greenstone software is a collaborative effort between many people. Rodger McNab and Stefan Boddie are the principal architects and implementors. Contributions have been made by David Bainbridge, George Buchanan, Hong Chen, Michael Dewsnip, Katherine Don, Elke Duncker, Carl Gutwin, Geoff Holmes, Dana McKay, John McPherson, Craig Nevill-Manning, Dynal Patel, Gordon Paynter, Bernhard Pfahringer, Todd Reed, Bill Rogers, John Thompson, and Stuart Yeates. Other members of the New Zealand Digital Library project provided advice and inspiration in the design of the system: Mark Apperley, Sally Jo Cunningham, Matt Jones, Steve Jones, Te Taka Keegan, Michel Loots, Malika Mahoui, Gary Marsden, Dave Nichols and Lloyd Smith. We would also like to acknowledge all those who have contributed to the GNU-licensed packages included in this distribution: MG, GDBM, PDFTOHTML, PERL, WGET, WVWARE and XLHTML.

Contents

Introduction
Scanners and scanning
Scanners
Preparing the documents
The scanning process
Productivity and resources
OCR: Optical Character Recognition
The OCR process
Productivity and resources
Alternatives to OCR
Combining scanning and OCR
Three examples: 1000 to 100,000 pages
Typical small collection: 500 to 1000 pages
All publications from an organization: 5000 pages
A small library: 100,000 pages
Creating an electronic collection
Methods of collection building
Getting started in seven steps and 15 minutes

Copyright © 2002 2003 2004 2005 2006 2007 by the New Zealand Digital Library Project at the University of Waikato, New Zealand.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”