A brief history
of the
Greenstone Digital Library Software
Ian H. Witten and David Bainbridge
University of
Waikato, Hamilton, New Zealand
At the time of writing (January 2007)
Greenstone—a versatile open source multilingual digital library environment
with over a decade of pedigree—has a user base hailing from over 70
countries, is downloaded 4,500 times a month, runs on all popular operating
systems (even the iPod!), and has a reader�s interface in over 40 languages.
How did this software project and the research team behind it reach this point?
Team members often give anecdotal stories about life behind the scenes at
conferences and workshops; this article gives a more definitive and coherent
account of the project.
The New Zealand Digital Library project grew
out of research on text compression (Bell et al.,
1990) and, later, index compression (Witten et al., 1994). Around this time we heard of digital libraries, and pointed
out the potential advantages of compression at the first-ever digital library
conference (Bell et al., 1994). The New Zealand
Digital Library Project was established in 1995, beginning with a collection of
50,000 computer science technical reports downloaded from the Internet (Witten et
al., 1995). At the time several research groups in
computer science departments collecting technical reports and making them
available on the web: our main contribution was the use of full-text indexing
for effective search. We were assisted by equipment funding from the New
Zealand Lotteries Board and operating funding from the New Zealand Foundation
for Research, Science and Technology (1996–1998 and 2002–2007).
In 1997 we began to work with Human Info NGO
to help them produce fully-searchable CD-ROM collections of humanitarian
information. This necessitated making our server (and in particular the
full-text search engine it used), which had been developed under Linux, run on
Windows machines—including the early Windows 3.1 and 3.11 because,
although by then obsolete, they were prevalent in developing countries. This
was demanding but largely uninteresting technically: we had to develop
expertise in long-forgotten software systems, and it was hard to find suitable
compilers (eventually we obtained a �second-hand� one from a software auction).
The first publicly available CD-ROM, the Humanity Development Library 1.3, was issued in April 1998. A French collection, UNESCO�s Sahel
point Doc, appeared a year later; all the
documents, along with the entire interface, help text, and full-text search mechanism,
were in French. The first multilingual collection came six months later: a
Spanish/English Biblioteca Virtual de Desastres/Virtual
Disaster Collection. Since then about 40 CD-ROM
collections have been published. They are produced by Human Info in Romania: we
wrote the software and were heavily involved in preparing the first few
CD-ROMs, and then transferred the technology to them so that they could proceed
independently. At this point we realized that we did not aspire to be a digital
library site ourselves, but rather to develop software that others could use
for their own digital libraries.
Towards the end of 1997 we adopted the term Greenstone: we decided that �New Zealand Digital Library Software� was not
only clumsy but could impede international acceptance and therefore sought a
new name. �Greenstone� turned out to be an inspired choice: snappy, memorable,
and un-nationalistic but with strong national connotations within New
Zealand—a form of nephrite jade, greenstone is a hallowed substance for Māori, valued more highly than gold.
Moreover, it is easy to spell and pronounce. Our earlier Weka (think mecca) machine learning workbench,
an acronym that in Māori spells the name of a flightless native bird, suffers from being
mispronounced weaka by some. And the term
Greenstone is not overly common—today we are the number one Google hit
for it. The decision to issue the software as open source, and to use the GNU
General Public Licence, was made around the same time. We did not discuss this
with University of Waikato authorities—New Zealand universities are
obsessed with commercialization and we would have been forced into an endless
round of deliberations on commercial licensing—but simply began to
release under GPL. Early releases were posted on our website greenstone.org (which was registered on 13 August 1998), but in November 2000 we
moved to the SourceForge site for distribution (partly due to the per-megabyte
charging scheme that our university levied for both outgoing and incoming web
traffic). Our employers were not particularly happy when our licensing fait
accompli became apparent years later, but have
grown to accept (and perhaps even appreciate) the status quo because of our
evident international success.
An early in-house project utilizing
Greenstone was the Niupepa collection of Māori-language newspapers. We began the
work of OCRing 20,000 page images in 1998, and made an initial demonstration
collection. In 2000–2001 we received (retrospective!) funding from the
Ministry of Education to continue the work. Virtually the entire Niupepa was
available online early in 2001, but the collection was not officially launched
until March 2002 at the Annual General meeting of Te Rūnanga o Ngā
Kura Kaupapa Māori (the controlling body of Māori medium/theology
schools). Niupepa is still the largest collection of on-line Māori-language
documents, and is extensively used; Apperley et al.
(2002) gives a comprehensive description of how it was developed. On 13
November 2000, in a moving ceremony, the Māori people presented our
project with a ceremonial toki (adze) as a gift
in recognition of our contributions to indigenous language preservation (see
Figure 1).
In 1999 the BBC in London were concerned
about the threat of Y2K bugs on their database of one million lengthy metadata
records for radio and television programmes. They decided to augment their
heavy-duty mainframe database with a fully-searchable Greenstone system that
could run on ordinary desktop machines. A Greenstone collection was duly built
and delivered (within two days of receiving the full dataset). We tried to get
them to the point where they could maintain it themselves, but they were not
interested: instead we updated it for them regularly (incidentally providing us
with a useful small source of revenue). They eventually moved to different
technology in early 2006, with the aim of making the metadata (and ultimately
the programme content) publicly available online in a way that resembles what
Amazon does for books—something that we think requires a tailor-made
portal rather than a general-purpose digital library system.
We became acquainted with
UNESCO through Human Info�s long-term relationship with them. Although they
supported Human Info�s goal of producing humanitarian CD-ROMs and distributing
them in developing countries, UNESCO were really interested in sustainable development, which requires empowering people in those countries to
produce and distribute their own digital library collections—following
that old Chinese proverb about giving a man fish versus teaching him to fish.[1]
We had by then transferred our collection-building technology to Human Info,
and tried (though without success) to transfer it to the BBC, but this was a
completely different proposition: to put the power to build collections into the hands of those other than IT specialists,
typically librarians. We began by packaging up our PERL scripts and documenting
them so that others could use them, and slowly, painfully, came to terms with
the fact that operating at this level is anathema for librarians. In 2001 we
produced a web-based system called the �Collector� that was announced in a
paper whose title proudly proclaimed �Power to the people: end-user building of
digital library collections� (Witten et al.,
2001). However, this was never a great success: web-based submission to
repository systems (including Greenstone collections) is commonplace today, but
we were trying to allow users to design and configure digital library
collections over the web as well as populate them. The next year we began a
Java development that became known as the Greenstone Librarian Interface
(Bainbridge et al., 2003), which grew over the
years into a comprehensive system for designing and building collections and
includes its own metadata editor.
From the outset, UNESCO�s goal was to
produce CD-ROMs containing the entire Greenstone software (not just individual
collections plus the run-time system, as in Human Info�s products), so that it
could be used by people in developing countries who did not have ready access
to the Internet.[2] These were
the tangible outcomes of a series of small contracts with UNESCO: we felt that
the CD-ROMs were more of symbolic than actual significance because in practice
they rapidly became outdated by frequent new releases of the software appearing
on the Internet. They were produced every year from 2002 to 2006. The CD-ROMs
contained all the auxiliary software needed to run Greenstone as well, which
are not included in the Internet distributions because they can be obtained
from other sources (links are provided). When we and others started to give
workshops, tutorials, and courses on Greenstone we adopted a policy of putting
all instructional material—PowerPoint slides, exercises, sample files for
projects—on a workshop CD-ROM, and began to include this auxiliary
material on the UNESCO distributions. This ultimately led to their downfall,
for the company producing the CD-ROMs began to question the provenance of some
of the sample files they contained, and ultimately demanded explicit proof of
permission to reproduce all the information and software. Although everything
was, in principle, open source, so much had to be stripped out that the 2006
CD-ROM distribution was seriously emasculated. CD-ROM distributions for workshops,
however, continue because they are produced on a far more limited scale.
Good documentation was (rightly!) seen by
UNESCO as crucial. They were keen to make the Greenstone technology available
in Spanish, French, and Russian (Arabic and Chinese are also official UNESCO
languages, but for some reason never figured in our discussions). We already
had versions of the interface in these (and many other) languages, but UNESCO
wanted everything to be translated—not
just the documentation, which was extensive (four substantial manuals) but all
the installation instructions, README files, example collections, etc. We might
have demurred had we realized the extent to which such a massive translation
effort would threaten to hobble the potential for future development, and have
since suffered mightily in getting everything—including last-minute
interface tweaks—translated for each upcoming UNESCO CD-ROM release. The
cumbersome process of maintaining up-to-date translations in the face of
continual evolution of the software—which is, of course, to be expected
in open source systems—led us to devise a scheme for maintaining all
language fragments in a version control system so that the system could tell
what needed updating. This resulted in the Greenstone Translator�s Interface, a
web portal where officially registered translators can examine the status of
the language interface for which they are responsible, and update it
(Bainbridge et al., 2003). Today the interface
has been translated into 43 languages (with a further 8 in progress), 28 of
which have a designated volunteer maintainer.
Most people are surprised by the small size
of the Greenstone team. Historically, for most of the duration of the project
we have employed 1–2 programmers, although recently the number has crept
up to 3–4. Several faculty involved in aspects of digital library
research are associated with the project, but only two have viewed the
Greenstone software as their main interest—partly because although the
work is ground-breaking the research outputs are of questionable value in the
university evaluation and promotion process. Graduate students rarely
contribute to the code base directly because of concerns about retaining the
production-level code quality and programming conventions painstakingly
acquired over many years, although several students work in areas cognate to
digital libraries. Our external users tend to be librarians rather than
software specialists and we have received few major contributions or bug fixes
from them. To summarize, the Greenstone digital library software has been
created by a couple of skilled people working over a 10-year period—and
along the way there have been several changes of personnel. It�s amazing what
excellent programmers can do.
With UNESCO�s encouragement (and occasional
sponsorship), we have worked to enable developing countries to take advantage
of digital library technology by running hands-on workshops. This has enabled
team members to travel to many interesting places. In what other area, for example,
might a computer science professor get the opportunity to spend a week giving a
course at the UN International Criminal Tribunal for Rwanda in Arusha,
Tanzania, at the foot of Mount Kilimanjaro—or in Havana, Cuba?
Recognizing that devolution is essential for sustainability, we are now
attempting to distribute this effort by establishing regional Greenstone
Support Groups: the first, for South Asia, was launched in April 2006.
Greenstone won the 2004 IFIP Namur award,
which recognizes recipients for raising awareness internationally of the social
implications of information and communication technologies; and was a finalist
for the 2006 Stockholm Challenge, the world�s leading ICT Prize for
entrepreneurs who use ICT to improve living conditions and increase economic
growth. Our project received the Vannevar Bush award for the best paper at the ACM
Digital Libraries Conference in 1999, the Literati
Club Highly Commended Award in 2003, and the best international paper award at
the Joint Conference on Digital Libraries in
2004.
Greenstone is promoted by UNESCO (Paris)
under its Information for All programme. It is
distributed with the FAO�s (Rome) Information Management Resource Kit (2005), along with tutorial information on its use. It forms the
basis of the Institute for Information Technology in Education�s course on Digital
Libraries in Education (2006). An extensive early
description appears in Witten and Bainbridge�s book How to build a digital
library (Witten and Bainbridge, 2003). In
2002–2003 our principal developer at that time left the project to form
DL Consulting, an enterprise that specializes in building and customizing
Greenstone collections and has won several awards as the region�s
fastest-growing exporter and ICT company.
Many early digital library projects focused
on interoperability. Although this is clearly a very important issue, we felt
that this attention was premature—we well remember a digital library
conference where interest was so strong that there were two panel discussions on
interoperability, the only catch being that they were parallel sessions, which
permitted no � er � interoperability. We adopted the informal motto �first
operability, then interoperability�; and focused on other issues such as
ingesting documents and metadata in a very wide variety of formats. More
recently we have added many interoperability features, which, as we had
expected, were not hard to retrofit: communication with Z39.50, SRW, OAI-PMH,
DSpace, and METS are just a few examples (Bainbridge et al., 2006).
We continually struggle with the fundamental
conflict between stability and evolution. We place a strong emphasis on
backwards compatibility: it is rare for new software releases to have any
effect at all on existing collections, and then only in minor respects. Only
recently we have made a concession to hardware obsolescence by making
alterations that no longer allow standard Greenstone collections to be served
on Windows 3.1/3.11.
In order to take advantage of new
developments in software technology we began a new project, Greenstone 3, which
is a complete redesign and reimplementation of the original digital library
software (Greenstone 2). It incorporates all the features of the existing
system, and is backwards compatible: that is, it can build and run existing
collections without modification. It is structured as a network of independent
modules that communicate using XML: thus it runs in a distributed fashion and
can be spread across different servers as necessary. This modular design increases
the flexibility and extensibility of Greenstone. However, although initial
versions of Greenstone 3 have been released, continual demands from users for
further development of Greenstone 2 have delayed progress on the new version.
Greenstone 3 was originally envisaged purely
as a research framework: backwards compatibility would be possible but required
IT skills. We have achieved this aim: it is now much easier for graduate and
undergraduate project students to build upon the digital library core (e.g. the
Language Learning Digital Library, Wu and Witten 2006). However, we have found
that maintaining two independent versions of Greenstone—in particular,
ensuring backwards compatibility when new and enhanced features are added to
Greenstone 2—is beyond our resources. Consequently we have committed to a
new vision: to develop Greenstone 3 to the point that, by default, its
installation and operation is, to the user, indistinguishable from Greenstone
2. This work will be included in the next release of Greenstone 3, slated for
release in March 2007.
References
Apperley, M., Keegan, T.T., Cunningham, S.J.
and Witten, I.H. (2002) �Delivering the Maori-language newspapers on the
Internet.� Rere atu, taku manu! Discovering history, language and politics
in the Maori-language newspapers, edited by J.
Curnow, N. Hopa and J. McRae. Auckland University Press: 211-232.
Bainbridge, D., Thompson, J. and Witten,
I.H. (2003) �Assembling and enriching digital library collections.� Proc
Joint Conference on Digital Libraries, Houston,
Texas.
Bainbridge, D., Edgar, K.D., McPherson, J.R.
and Witten, I.H. (2003) �Managing change in a digital library system with many
interface languages.� Proc European Conference on Digital Libraries ECDL2003, Trondheim, Norway.
Bainbridge, D., Ke, K.-Y.J. and Witten, I.H.
(2006) �Document level interoperability for collection creators.� Proc Joint
Conference on Digital Libraries, pp. 105-106,
Chapel Hill, NC.
Bell, T.C., Moffat, A. and Witten, I.H.
(1994) �Compressing the digital library.� Proc Digital Libraries '94, pp. 41-46, College Station, Texas, June.
Bell, T.C., Cleary, J.G. and Witten, I.H.
(1990) Text compression. Prentice Hall,
Englewood Cliffs, NJ.
Witten, I.H., Moffat, A. and Bell, T.C.
(1994) Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York.
Witten, I.H., Cunningham, S.J., Vallabh, M.
and Bell, T.C. (1995) �A New Zealand digital library for computer science
research.� Proc Digital Libraries '95, pp.
25-30, Austin, Texas, June.
Witten, I. H., Bainbridge, D. and Boddie,
S.J. (2001) �Power to the people: end-user building of digital library
collections.� Proc Joint Conference on Digital Libraries, Roanoke, VA.
Witten, I.H. and Bainbridge, D. (2003) How
to build a digital library. Morgan Kaufmann, San
Francisco, CA.
Wu, S. and Witten, I.H. (2006.� Towards a
digital library for language learning.� Proc European Conference on Digital
Libraries, Alicante, Spain.
Timeline of
significant events
2007 |
|
Greenstone distributed with IITE�s course Digital
Libraries in Education |
2006 |
May |
Finalist for the Stockholm Challenge |
|
Apr |
Greenstone Support Group for South Asia launched |
2005 |
Nov |
Initial release of Greenstone3 |
|
Feb |
Greenstone distributed with FAO�s Information
Management Resource Kit |
2004 |
Jan |
IFIP Namur award |
2002 |
Jun |
DL Consulting incorporated |
|
|
Begin development of the Greenstone Translator�s
Interface |
2002 |
Apr |
Began development of Greenstone3 |
|
Mar |
Official opening of the Niupepa collection |
|
|
Begin development of the Greenstone Librarian Interface |
|
Jun |
First UNESCO Greenstone CD-ROM |
2001 |
|
Development of the Collector |
2000 |
Nov |
Begin to distribute software on SourceForge |
|
Nov |
Toki presented to the NZ Digital Library
project on behalf of the entire Māori people |
|
Aug |
Formally established cooperative effort with UNESCO
and Human Info NGO |
|
Apr |
Greenstone mailing list started |
1999 |
Dec |
BBC collection established |
1998 |
Aug |
Greenstone.org website established |
|
Apr |
First CD-ROM collection released: Humanity Development
Library |
1997 |
|
Decision to use the GPL; name �Greenstone� adopted |
|
|
Began work with Human Info NGO to produce humanitarian
CD-ROMs |
1995 |
May |
Digital library of Computer Science Technical Reports |
Greenstone
releases
2006 |
Dec |
2.72 |
|
Oct |
2.71 |
|
Mar |
2.70 |
|
Jan |
2.63 |
2005 |
Jun |
2.62 |
|
Apr |
2.60 |
|
Mar |
2.53 |
2004 |
Oct |
2.52 |
|
Jun |
2.51 |
|
Feb |
2.50 |
2003 |
Dec |
2.41 |
|
Jun |
2.40 |
|
Mar |
2.39 |
2002 |
Jan |
2.38 |
2001 |
Oct |
2.37 |
|
Jun |
2.36 |
|
May |
2.35 |
|
Apr |
2.33 |
|
Feb |
2.31 |
|
Feb |
2.30 |
2000 |
Dec |
2.30 |
|
Sep |
2.27 |
|
Jul |
2.25 |
|
Jun |
2.23 |
|
Jun |
2.22 |
|
Apr |
2.21 |
|
Feb |
2.12 |
UNESCO
Greenstone CD-ROMs
These contain the entire Greenstone software, and are
intended for use in developing countries with limited access to the Internet.
2006 May UNESCO
CD-ROM v2.7 (Greenstone v2.70) English/French/Spanish/Russian
2005 May UNESCO
CD-ROM v2.6 (Greenstone v2.60) English/French/Spanish/Russian
2004 Mar UNESCO
CD-ROM v2.0 (Greenstone v2.50) English/French/Spanish/Russian
2003 Mar UNESCO
CD-ROM v1.1 (Greenstone v2.39) English/French/Spanish
2002 Jun UNESCO
CD-ROM v1.0 (Greenstone v2.38) English
Human Info NGO
CD-ROMs
Prior to the year
2000 we worked with Human Info NGO to help them produce humanitarian CD-ROMs
using Greenstone. (Many more have been produced since; a total of about 40 to
date)
2006 |
Apr |
Appropriate Technology
Knowledge Collection |
2005 |
May |
Gender and HIV/AIDS
Electronic Library |
|
??? |
Textes de Base
sur L�Environment au Senegal (French) |
|
Jan |
Educational
Aids/Lehr- und Lernmittel/Moyens didactiques/Material did�ctico v3.0
(English/German/French/Spanish) |
2004 |
Nov |
Africa Collection
for Transition: From Relief to Development v1.01 |
|
Sep |
UNECE Committee for
Trade, Industry and Enterprise Development (English/French /Russian) |
|
??? |
INEE Technical Kit
on Education in Emergencies and Early Recovery |
|
Jan |
Educational
Aids/Lehr- und Lernmittel/Moyens didactiques/Material did�ctico v2.0
(English/German/French/Spanish) |
2003 |
??? |
Education, Work and
the Future/Education Travail et Avenir (English/French) v2.0 |
|
Oct |
Revised Curricula
for Technical Colleges and Polytechnics |
|
Jul |
UNAIDS Library v2.0
(English/French/Spanish/Russian) |
|
May |
Biblioteca Virtual
de Salud para des Desastres/Health Library for Disasters v2.0
(Spanish/English) |
|
Mar |
Food and Nutrition
Library v2.2 |
|
??? |
Educational
Aids/Lehr- und Lernmittel/Moyens didactiques/Material did�ctico v1.0
(English/German/French/Spanish) |
|
Jan |
ICT Training Kit
and Digital Library for African Educators |
2002 |
Aug |
Community
Development Library for Sustainable Development and Basic Human Needs v2.1 |
|
Jul |
Food and Nutrition
Library v2.0 |
|
Mar |
UNDP Energy for
Sustainable Development Library |
2001 |
Dec |
UNAIDS Library of
Current Documents v1.1 (English/French/Spanish/Russian) |
|
Oct |
East African
Development Library |
|
??? |
Safe Motherhood
Strategies (English/French/Spanish) |
|
Jul |
Researching
Education Development |
|
Jun |
Biblioteca Virtual
de Salud para des Desastres/Health Library for Disasters (Spanish/English) |
|
Jun |
WHO Medicines
Bookshelf |
|
Jan |
Africa Collection
for Transition |
2000 |
Dec |
World Environmental
Library v1.1 |
|
??? |
Sahel point Doc
v2.0 (French) |
|
Jan |
Food and Nutrition
Library v1.0 |
1999 |
Dec |
Medical and Health
Library v1.0 |
|
Dec |
Biblioth�que
pour le D�veloppement Durable et des Besoins Essentials v1.0 (French) |
|
Nov |
Biblioteca Virtual
de Desastres/Virtual Disaster Library (Spanish, some English) |
|
??? |
UNU Collection on
Critical Global Issues v2.0 |
|
Mar |
Sahel point Doc
(French) |
|
Feb |
Humanity
Development Library v2.0 |
1998 |
??? |
UNU Collection on
Critical Global Issues v1.0 |
|
Apr |
Humanity
Development Library v1.3 |
Greenstone workshops
As well as tutorials
at conferences in the US and Europe, many workshops have been given on
Greenstone in developing countries. Here are some that have been given by
people closely associated with the project; there have been many others. They
range from half a day to 6 days; most are 1–3 days. Many have been
sponsored by UNESCO.
2007 |
May |
Trinidad and Tobago
National Library |
|
Feb |
Vellore, India |
2006 |
Dec |
Calcutta, India |
|
Dec |
New Delhi, India |
|
Nov–Dec |
Kozhikode, India |
|
Oct |
Vladimir, Russia |
|
Aug |
Tirunelvelli, India |
|
Jun |
Hawaii, US |
|
Mar–Apr |
Madras, India |
|
Mar |
Durban, South
Africa |
|
Feb |
Bangkok, Thailand |
2005 |
Nov |
Cape Town, South
Africa |
|
Nov–Dec |
Arusha, Tanzania |
|
Sep |
Suva, Fiji |
|
Aug |
Bangalore, India |
|
July |
Siena, Italy |
|
May |
Ho Chi Minh City,
Vietnam |
|
May |
Kozhikode, India |
2004 |
??? |
Bombay, India |
|
|
Havana, Cuba |
|
??? |
Trirandom, Kerala |
|
Aug–Sep |
Windhoek, Namibia |
|
Jul |
Suva, Fiji |
|
Jun |
Cape Town, South
Africa |
|
Mar |
Dakar, Senegal |
|
Mar |
Cape Town, South
Africa |
|
Feb |
Gaborone, Botswana |
|
Feb |
Almaty, Kazakhstan |
2003 |
Nov |
Dakar, Senegal |
|
Nov |
Suva, Fiji |
|
May |
Bangalore, India
(IISC) |
[1] In New Zealand, by
the way, they say �give a man a fish and he�ll eat for a day; teach a man to
fish and he�ll sit in a boat and drink beer for the rest of his life.�
[2] Incidentally, UNESCO refused to use our toki logo on the CD-ROMs because they feel that in some developing
countries axes are irrevocably linked to genocide. Our protests that this
object is clearly ceremonial fell on deaf ears. Dealing with international
agencies is sometimes very frustrating.