A brief history of the
Greenstone Digital Library Software

 

Ian H. Witten and David Bainbridge

 

University of Waikato, Hamilton, New Zealand

 

 

 

At the time of writing (January 2007) Greenstone—a versatile open source multilingual digital library environment with over a decade of pedigree—has a user base hailing from over 70 countries, is downloaded 4,500 times a month, runs on all popular operating systems (even the iPod!), and has a reader’s interface in over 40 languages. How did this software project and the research team behind it reach this point? Team members often give anecdotal stories about life behind the scenes at conferences and workshops; this article gives a more definitive and coherent account of the project.

The New Zealand Digital Library project grew out of research on text compression (Bell et al., 1990) and, later, index compression (Witten et al., 1994). Around this time we heard of digital libraries, and pointed out the potential advantages of compression at the first-ever digital library conference (Bell et al., 1994). The New Zealand Digital Library Project was established in 1995, beginning with a collection of 50,000 computer science technical reports downloaded from the Internet (Witten et al., 1995). At the time several research groups in computer science departments collecting technical reports and making them available on the web: our main contribution was the use of full-text indexing for effective search. We were assisted by equipment funding from the New Zealand Lotteries Board and operating funding from the New Zealand Foundation for Research, Science and Technology (1996–1998 and 2002–2007).

In 1997 we began to work with Human Info NGO to help them produce fully-searchable CD-ROM collections of humanitarian information. This necessitated making our server (and in particular the full-text search engine it used), which had been developed under Linux, run on Windows machines—including the early Windows 3.1 and 3.11 because, although by then obsolete, they were prevalent in developing countries. This was demanding but largely uninteresting technically: we had to develop expertise in long-forgotten software systems, and it was hard to find suitable compilers (eventually we obtained a “second-hand” one from a software auction). The first publicly available CD-ROM, the Humanity Development Library 1.3, was issued in April 1998. A French collection, UNESCO’s Sahel point Doc, appeared a year later; all the documents, along with the entire interface, help text, and full-text search mechanism, were in French. The first multilingual collection came six months later: a Spanish/English Biblioteca Virtual de Desastres/Virtual Disaster Collection. Since then about 40 CD-ROM collections have been published. They are produced by Human Info in Romania: we wrote the software and were heavily involved in preparing the first few CD-ROMs, and then transferred the technology to them so that they could proceed independently. At this point we realized that we did not aspire to be a digital library site ourselves, but rather to develop software that others could use for their own digital libraries.

Towards the end of 1997 we adopted the term Greenstone: we decided that “New Zealand Digital Library Software” was not only clumsy but could impede international acceptance and therefore sought a new name. “Greenstone” turned out to be an inspired choice: snappy, memorable, and un-nationalistic but with strong national connotations within New Zealand—a form of nephrite jade, greenstone is a hallowed substance for Māori, valued more highly than gold. Moreover, it is easy to spell and pronounce. Our earlier Weka (think mecca) machine learning workbench, an acronym that in Māori spells the name of a flightless native bird, suffers from being mispronounced weaka by some. And the term Greenstone is not overly common—today we are the number one Google hit for it. The decision to issue the software as open source, and to use the GNU General Public Licence, was made around the same time. We did not discuss this with University of Waikato authorities—New Zealand universities are obsessed with commercialization and we would have been forced into an endless round of deliberations on commercial licensing—but simply began to release under GPL. Early releases were posted on our website greenstone.org (which was registered on 13 August 1998), but in November 2000 we moved to the SourceForge site for distribution (partly due to the per-megabyte charging scheme that our university levied for both outgoing and incoming web traffic). Our employers were not particularly happy when our licensing fait accompli became apparent years later, but have grown to accept (and perhaps even appreciate) the status quo because of our evident international success.

An early in-house project utilizing Greenstone was the Niupepa collection of Māori-language newspapers. We began the work of OCRing 20,000 page images in 1998, and made an initial demonstration collection. In 2000–2001 we received (retrospective!) funding from the Ministry of Education to continue the work. Virtually the entire Niupepa was available online early in 2001, but the collection was not officially launched until March 2002 at the Annual General meeting of Te Rūnanga o Ngā Kura Kaupapa Māori (the controlling body of Māori medium/theology schools). Niupepa is still the largest collection of on-line Māori-language documents, and is extensively used; Apperley et al. (2002) gives a comprehensive description of how it was developed. On 13 November 2000, in a moving ceremony, the Māori people presented our project with a ceremonial toki (adze) as a gift in recognition of our contributions to indigenous language preservation (see Figure 1).

In 1999 the BBC in London were concerned about the threat of Y2K bugs on their database of one million lengthy metadata records for radio and television programmes. They decided to augment their heavy-duty mainframe database with a fully-searchable Greenstone system that could run on ordinary desktop machines. A Greenstone collection was duly built and delivered (within two days of receiving the full dataset). We tried to get them to the point where they could maintain it themselves, but they were not interested: instead we updated it for them regularly (incidentally providing us with a useful small source of revenue). They eventually moved to different technology in early 2006, with the aim of making the metadata (and ultimately the programme content) publicly available online in a way that resembles what Amazon does for books—something that we think requires a tailor-made portal rather than a general-purpose digital library system.

Text Box:  	This toki (adze) was a gift from the Māori people in recognition of our project’s contributions to indigenous language preservation, and resides in the project laboratory at the University of Waikato. In Māori culture there are several kinds of toki, with different purposes. This one is a ceremonial adze, toki pou tangata, a symbol of chieftainship. The rau (blade) is sharp, hard, and made of pounamu or greenstone—hence the Greenstone software, at the cutting edge of digital library technology. There are three figures carved into the toki. The forward-looking one looks out to where the rau is pointing to ensure that the toki is appropriately targeted. The backward-looking one at the top is a sentinel that guards where the rau can’t see. There is a third head at the bottom of the handle which makes sure that the chief’s decisions—to which the toki lends authority—are properly grounded in reality. The name of this taonga, or art-treasure, is Toki Pou Hinengaro, which translates roughly as “the adze that shapes the excellence of thought.” 
Figure 1. The Greenstone toki
We became acquainted with UNESCO through Human Info’s long-term relationship with them. Although they supported Human Info’s goal of producing humanitarian CD-ROMs and distributing them in developing countries, UNESCO were really interested in sustainable development, which requires empowering people in those countries to produce and distribute their own digital library collections—following that old Chinese proverb about giving a man fish versus teaching him to fish.[1] We had by then transferred our collection-building technology to Human Info, and tried (though without success) to transfer it to the BBC, but this was a completely different proposition: to put the power to build collections into the hands of those other than IT specialists, typically librarians. We began by packaging up our PERL scripts and documenting them so that others could use them, and slowly, painfully, came to terms with the fact that operating at this level is anathema for librarians. In 2001 we produced a web-based system called the “Collector” that was announced in a paper whose title proudly proclaimed “Power to the people: end-user building of digital library collections” (Witten et al., 2001). However, this was never a great success: web-based submission to repository systems (including Greenstone collections) is commonplace today, but we were trying to allow users to design and configure digital library collections over the web as well as populate them. The next year we began a Java development that became known as the Greenstone Librarian Interface (Bainbridge et al., 2003), which grew over the years into a comprehensive system for designing and building collections and includes its own metadata editor.

From the outset, UNESCO’s goal was to produce CD-ROMs containing the entire Greenstone software (not just individual collections plus the run-time system, as in Human Info’s products), so that it could be used by people in developing countries who did not have ready access to the Internet.[2] These were the tangible outcomes of a series of small contracts with UNESCO: we felt that the CD-ROMs were more of symbolic than actual significance because in practice they rapidly became outdated by frequent new releases of the software appearing on the Internet. They were produced every year from 2002 to 2006. The CD-ROMs contained all the auxiliary software needed to run Greenstone as well, which are not included in the Internet distributions because they can be obtained from other sources (links are provided). When we and others started to give workshops, tutorials, and courses on Greenstone we adopted a policy of putting all instructional material—PowerPoint slides, exercises, sample files for projects—on a workshop CD-ROM, and began to include this auxiliary material on the UNESCO distributions. This ultimately led to their downfall, for the company producing the CD-ROMs began to question the provenance of some of the sample files they contained, and ultimately demanded explicit proof of permission to reproduce all the information and software. Although everything was, in principle, open source, so much had to be stripped out that the 2006 CD-ROM distribution was seriously emasculated. CD-ROM distributions for workshops, however, continue because they are produced on a far more limited scale.

Good documentation was (rightly!) seen by UNESCO as crucial. They were keen to make the Greenstone technology available in Spanish, French, and Russian (Arabic and Chinese are also official UNESCO languages, but for some reason never figured in our discussions). We already had versions of the interface in these (and many other) languages, but UNESCO wanted everything to be translated—not just the documentation, which was extensive (four substantial manuals) but all the installation instructions, README files, example collections, etc. We might have demurred had we realized the extent to which such a massive translation effort would threaten to hobble the potential for future development, and have since suffered mightily in getting everything—including last-minute interface tweaks—translated for each upcoming UNESCO CD-ROM release. The cumbersome process of maintaining up-to-date translations in the face of continual evolution of the software—which is, of course, to be expected in open source systems—led us to devise a scheme for maintaining all language fragments in a version control system so that the system could tell what needed updating. This resulted in the Greenstone Translator’s Interface, a web portal where officially registered translators can examine the status of the language interface for which they are responsible, and update it (Bainbridge et al., 2003). Today the interface has been translated into 43 languages (with a further 8 in progress), 28 of which have a designated volunteer maintainer.

Most people are surprised by the small size of the Greenstone team. Historically, for most of the duration of the project we have employed 1–2 programmers, although recently the number has crept up to 3–4. Several faculty involved in aspects of digital library research are associated with the project, but only two have viewed the Greenstone software as their main interest—partly because although the work is ground-breaking the research outputs are of questionable value in the university evaluation and promotion process. Graduate students rarely contribute to the code base directly because of concerns about retaining the production-level code quality and programming conventions painstakingly acquired over many years, although several students work in areas cognate to digital libraries. Our external users tend to be librarians rather than software specialists and we have received few major contributions or bug fixes from them. To summarize, the Greenstone digital library software has been created by a couple of skilled people working over a 10-year period—and along the way there have been several changes of personnel. It’s amazing what excellent programmers can do.

With UNESCO’s encouragement (and occasional sponsorship), we have worked to enable developing countries to take advantage of digital library technology by running hands-on workshops. This has enabled team members to travel to many interesting places. In what other area, for example, might a computer science professor get the opportunity to spend a week giving a course at the UN International Criminal Tribunal for Rwanda in Arusha, Tanzania, at the foot of Mount Kilimanjaro—or in Havana, Cuba? Recognizing that devolution is essential for sustainability, we are now attempting to distribute this effort by establishing regional Greenstone Support Groups: the first, for South Asia, was launched in April 2006.

Greenstone won the 2004 IFIP Namur award, which recognizes recipients for raising awareness internationally of the social implications of information and communication technologies; and was a finalist for the 2006 Stockholm Challenge, the world’s leading ICT Prize for entrepreneurs who use ICT to improve living conditions and increase economic growth. Our project received the Vannevar Bush award for the best paper at the ACM Digital Libraries Conference in 1999, the Literati Club Highly Commended Award in 2003, and the best international paper award at the Joint Conference on Digital Libraries in 2004.

Greenstone is promoted by UNESCO (Paris) under its Information for All programme. It is distributed with the FAO’s (Rome) Information Management Resource Kit (2005), along with tutorial information on its use. It forms the basis of the Institute for Information Technology in Education’s course on Digital Libraries in Education (2006). An extensive early description appears in Witten and Bainbridge’s book How to build a digital library (Witten and Bainbridge, 2003). In 2002–2003 our principal developer at that time left the project to form DL Consulting, an enterprise that specializes in building and customizing Greenstone collections and has won several awards as the region’s fastest-growing exporter and ICT company.

Many early digital library projects focused on interoperability. Although this is clearly a very important issue, we felt that this attention was premature—we well remember a digital library conference where interest was so strong that there were two panel discussions on interoperability, the only catch being that they were parallel sessions, which permitted no … er … interoperability. We adopted the informal motto “first operability, then interoperability”; and focused on other issues such as ingesting documents and metadata in a very wide variety of formats. More recently we have added many interoperability features, which, as we had expected, were not hard to retrofit: communication with Z39.50, SRW, OAI-PMH, DSpace, and METS are just a few examples (Bainbridge et al., 2006).

We continually struggle with the fundamental conflict between stability and evolution. We place a strong emphasis on backwards compatibility: it is rare for new software releases to have any effect at all on existing collections, and then only in minor respects. Only recently we have made a concession to hardware obsolescence by making alterations that no longer allow standard Greenstone collections to be served on Windows 3.1/3.11.

In order to take advantage of new developments in software technology we began a new project, Greenstone 3, which is a complete redesign and reimplementation of the original digital library software (Greenstone 2). It incorporates all the features of the existing system, and is backwards compatible: that is, it can build and run existing collections without modification. It is structured as a network of independent modules that communicate using XML: thus it runs in a distributed fashion and can be spread across different servers as necessary. This modular design increases the flexibility and extensibility of Greenstone. However, although initial versions of Greenstone 3 have been released, continual demands from users for further development of Greenstone 2 have delayed progress on the new version.

Greenstone 3 was originally envisaged purely as a research framework: backwards compatibility would be possible but required IT skills. We have achieved this aim: it is now much easier for graduate and undergraduate project students to build upon the digital library core (e.g. the Language Learning Digital Library, Wu and Witten 2006). However, we have found that maintaining two independent versions of Greenstone—in particular, ensuring backwards compatibility when new and enhanced features are added to Greenstone 2—is beyond our resources. Consequently we have committed to a new vision: to develop Greenstone 3 to the point that, by default, its installation and operation is, to the user, indistinguishable from Greenstone 2. This work will be included in the next release of Greenstone 3, slated for release in March 2007.

 

References

 

Apperley, M., Keegan, T.T., Cunningham, S.J. and Witten, I.H. (2002) “Delivering the Maori-language newspapers on the Internet.” Rere atu, taku manu! Discovering history, language and politics in the Maori-language newspapers, edited by J. Curnow, N. Hopa and J. McRae. Auckland University Press: 211-232.

Bainbridge, D., Thompson, J. and Witten, I.H. (2003) “Assembling and enriching digital library collections.” Proc Joint Conference on Digital Libraries, Houston, Texas.

Bainbridge, D., Edgar, K.D., McPherson, J.R. and Witten, I.H. (2003) “Managing change in a digital library system with many interface languages.” Proc European Conference on Digital Libraries ECDL2003, Trondheim, Norway.

Bainbridge, D., Ke, K.-Y.J. and Witten, I.H. (2006) “Document level interoperability for collection creators.” Proc Joint Conference on Digital Libraries, pp. 105-106, Chapel Hill, NC.

Bell, T.C., Moffat, A. and Witten, I.H. (1994) “Compressing the digital library.” Proc Digital Libraries '94, pp. 41-46, College Station, Texas, June.

Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, NJ.

Witten, I.H., Moffat, A. and Bell, T.C. (1994) Managing gigabytes: compressing and indexing documents and images. Van Nostrand Reinhold, New York.

Witten, I.H., Cunningham, S.J., Vallabh, M. and Bell, T.C. (1995) “A New Zealand digital library for computer science research.” Proc Digital Libraries '95, pp. 25-30, Austin, Texas, June.

Witten, I. H., Bainbridge, D. and Boddie, S.J. (2001) “Power to the people: end-user building of digital library collections.” Proc Joint Conference on Digital Libraries, Roanoke, VA.

Witten, I.H. and Bainbridge, D. (2003) How to build a digital library. Morgan Kaufmann, San Francisco, CA.

Wu, S. and Witten, I.H. (2006.” Towards a digital library for language learning.” Proc European Conference on Digital Libraries, Alicante, Spain.

 

 

Timeline of significant events

 

2007

 

Greenstone distributed with IITE’s course Digital Libraries in Education

2006

May

Finalist for the Stockholm Challenge

 

Apr

Greenstone Support Group for South Asia launched

2005

Nov

Initial release of Greenstone3

 

Feb

Greenstone distributed with FAO’s Information Management Resource Kit

2004

Jan

IFIP Namur award

2002

Jun

DL Consulting incorporated

 

 

Begin development of the Greenstone Translator’s Interface

2002

Apr

Began development of Greenstone3

 

Mar

Official opening of the Niupepa collection

 

 

Begin development of the Greenstone Librarian Interface

 

Jun

First UNESCO Greenstone CD-ROM

2001

 

Development of the Collector

2000

Nov

Begin to distribute software on SourceForge

 

Nov

Toki presented to the NZ Digital Library project on behalf of the entire Māori people

 

Aug

Formally established cooperative effort with UNESCO and Human Info NGO

 

Apr

Greenstone mailing list started

1999

Dec

BBC collection established

1998

Aug

Greenstone.org website established

 

Apr

First CD-ROM collection released: Humanity Development Library

1997

 

Decision to use the GPL; name “Greenstone” adopted

 

 

Began work with Human Info NGO to produce humanitarian CD-ROMs

1995

May

Digital library of Computer Science Technical Reports

 

Greenstone releases

 

2006

Dec

2.72

 

Oct

2.71

 

Mar

2.70

 

Jan

2.63

2005

Jun

2.62

 

Apr

2.60

 

Mar

2.53

2004

Oct

2.52

 

Jun

2.51

 

Feb

2.50

2003

Dec

2.41

 

Jun

2.40

 

Mar

2.39

2002

Jan

2.38

2001

Oct

2.37

 

Jun

2.36

 

May

2.35

 

Apr

2.33

 

Feb

2.31

 

Feb

2.30

2000

Dec

2.30

 

Sep

2.27

 

Jul

2.25

 

Jun

2.23

 

Jun

2.22

 

Apr

2.21

 

Feb

2.12

 

UNESCO Greenstone CD-ROMs

 

These contain the entire Greenstone software, and are intended for use in developing countries with limited access to the Internet.

2006        May          UNESCO CD-ROM v2.7 (Greenstone v2.70)        English/French/Spanish/Russian

2005        May          UNESCO CD-ROM v2.6 (Greenstone v2.60)        English/French/Spanish/Russian

2004        Mar           UNESCO CD-ROM v2.0 (Greenstone v2.50)        English/French/Spanish/Russian

2003        Mar           UNESCO CD-ROM v1.1 (Greenstone v2.39)        English/French/Spanish

2002        Jun           UNESCO CD-ROM v1.0 (Greenstone v2.38)        English

 

Human Info NGO CD-ROMs

 

Prior to the year 2000 we worked with Human Info NGO to help them produce humanitarian CD-ROMs using Greenstone. (Many more have been produced since; a total of about 40 to date)

 

2006

Apr

Appropriate Technology Knowledge Collection

2005

May

Gender and HIV/AIDS Electronic Library

 

???

Textes de Base sur L’Environment au Senegal (French)

 

Jan

Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material didáctico v3.0 (English/German/French/Spanish)

2004

Nov

Africa Collection for Transition: From Relief to Development v1.01

 

Sep

UNECE Committee for Trade, Industry and Enterprise Development (English/French /Russian)

 

???

INEE Technical Kit on Education in Emergencies and Early Recovery

 

Jan

Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material didáctico v2.0 (English/German/French/Spanish)

2003

???

Education, Work and the Future/Education Travail et Avenir (English/French) v2.0

 

Oct

Revised Curricula for Technical Colleges and Polytechnics

 

Jul

UNAIDS Library v2.0 (English/French/Spanish/Russian)

 

May

Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters v2.0 (Spanish/English)

 

Mar

Food and Nutrition Library v2.2

 

???

Educational Aids/Lehr- und Lernmittel/Moyens didactiques/Material didáctico v1.0 (English/German/French/Spanish)

 

Jan

ICT Training Kit and Digital Library for African Educators

2002

Aug

Community Development Library for Sustainable Development and Basic Human Needs v2.1

 

Jul

Food and Nutrition Library v2.0

 

Mar

UNDP Energy for Sustainable Development Library

2001

Dec

UNAIDS Library of Current Documents v1.1 (English/French/Spanish/Russian)

 

Oct

East African Development Library

 

???

Safe Motherhood Strategies (English/French/Spanish)

 

Jul

Researching Education Development

 

Jun

Biblioteca Virtual de Salud para des Desastres/Health Library for Disasters (Spanish/English)

 

Jun

WHO Medicines Bookshelf

 

Jan

Africa Collection for Transition

2000

Dec

World Environmental Library v1.1

 

???

Sahel point Doc v2.0 (French)

 

Jan

Food and Nutrition Library v1.0

1999

Dec

Medical and Health Library v1.0

 

Dec

BibliothŹque pour le Développement Durable et des Besoins Essentials v1.0 (French)

 

Nov

Biblioteca Virtual de Desastres/Virtual Disaster Library (Spanish, some English)

 

???

UNU Collection on Critical Global Issues v2.0

 

Mar

Sahel point Doc (French)

 

Feb

Humanity Development Library v2.0

1998

???

UNU Collection on Critical Global Issues v1.0

 

Apr

Humanity Development Library v1.3

 

Greenstone workshops

 

As well as tutorials at conferences in the US and Europe, many workshops have been given on Greenstone in developing countries. Here are some that have been given by people closely associated with the project; there have been many others. They range from half a day to 6 days; most are 1–3 days. Many have been sponsored by UNESCO.

 

2007

May

Trinidad and Tobago National Library

 

Feb

Vellore, India

2006

Dec

Calcutta, India

 

Dec

New Delhi, India

 

Nov–Dec

Kozhikode, India

 

Oct

Vladimir, Russia

 

Aug

Tirunelvelli, India

 

Jun

Hawaii, US

 

Mar–Apr

Madras, India

 

Mar

Durban, South Africa

 

Feb

Bangkok, Thailand

2005

Nov

Cape Town, South Africa

 

Nov–Dec

Arusha, Tanzania

 

Sep

Suva, Fiji

 

Aug

Bangalore, India

 

July

Siena, Italy

 

May

Ho Chi Minh City, Vietnam

 

May

Kozhikode, India

2004

???

Bombay, India

 

 

Havana, Cuba

 

???

Trirandom, Kerala

 

Aug–Sep

Windhoek, Namibia

 

Jul

Suva, Fiji

 

Jun

Cape Town, South Africa

 

Mar

Dakar, Senegal

 

Mar

Cape Town, South Africa

 

Feb

Gaborone, Botswana

 

Feb

Almaty, Kazakhstan

2003

Nov

Dakar, Senegal

 

Nov

Suva, Fiji

 

May

Bangalore, India (IISC)

 



[1] In New Zealand, by the way, they say “give a man a fish and he’ll eat for a day; teach a man to fish and he’ll sit in a boat and drink beer for the rest of his life.”

[2] Incidentally, UNESCO refused to use our toki logo on the CD-ROMs because they feel that in some developing countries axes are irrevocably linked to genocide. Our protests that this object is clearly ceremonial fell on deaf ears. Dealing with international agencies is sometimes very frustrating.