Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: Building and searching with different indexers
Sample files: incr_build.zip
Devised for Greenstone version: 3.08
Modified for Greenstone version: 3.11

Incrementally building a collection using the command line

To allow you to quickly try out and experiment with our tutorial exercises, we tend to keep the number of sample files small. Every time you rebuild these collections, for simplicity, the default settings used in Greenstone mean that the previous version built is removed in its entirety. We refer to this as a full-rebuild. When building larger collections, this is inefficient.

Greenstone also has the ability to rebuild collections incrementally: this means the previous version of the collection is retained and only the changes detected need to be incorporated. There are, however, quite a few aspects to incremental building to control. This is the focus of this tutorial exercise.

To gain the best level of understanding, this tutorial builds collections using the command line.

  1. In GLI, create a new collection called Incremental With Manifests and base it on the Demo Collection. The short name of this collection will become incremen, and this will be the name of the collection's folder on the file system.

  1. Use GLI's Workspace view to navigate to this tutorial's sample files folder, incr_build. It will contain a folder named import. Open this. In GLI's Gather panel, drag and drop the 3 subfolders into your new collection. (You can also carry out this step using a file browser to copy the contents of the incr_build\import sample files folder into web\sites\localsite\collect\incremen\import.)

  1. Go to the Design panel > Search Indexes and look for Indexing Levels. Make document level searching the default.

  1. Do not build the collection in GLI. We'll be building and rebuilding manually, from the command-line terminal. So close GLI once the files and folders have finished copying into your collection. You can choose to run the Greenstone server at any stage, however.

  1. Since this is the first time we're building our collection, we're going to do a complete build. And we'll use the command line to do so. Open a terminal. To open a terminal in Windows, press Ctrl+r and type cmd in the Run dialog that displays. To open a terminal on a Mac machine, click on menu Go → Utilities → Terminal. Use the terminal to cd into your Greenstone installation folder. For instance, if you have your Greenstone installed on Windows as "Greenstone" within your account folder at C:\Users\me, then type the following in your terminal and hit Enter:

    cd C:\Users\me\Greenstone

    If there are any spaces in the filepath, put double quotes on either side of the filepath.

    On Linux or Macs, the general command is the same, but the installed location would be different and the slashes go the other way. For example, if installed in /Users/me/Greenstone3, you'd type the following and hit Enter:

    cd /Users/me/Greenstone3

    Now you're ready to set up the Greenstone environment in your terminal. On Windows, type the following into your terminal and hit Enter again:

    gs3-setup.bat

    On Linux and Mac:

    source ./gs3-setup.sh

    When using a terminal, you'll need to hit Enter after each command in order to execute the command you just finished typing. We won't repeat this instruction any more. Just remember to hit Enter after every complete command entered into a terminal.

    With the terminal now operating within your Greenstone installation folder, and with the Greenstone environment now set up and ready, type the following command to do a complete build of your new collection. Although the command contains the word "rebuild" in it, since this is the first time the collection's being built, it will just build it.

    perl -S full-rebuild.pl -site localsite incremen

    For the rest of this tutorial exercise, leave open this terminal in which you have set up your Greenstone's environment. We'll be using it throughout.

  1. If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone Server Interface application), then run it from the Start Menu on Windows now. You could also run the Greenstone server by running the gs3-server.bat script in the terminal if you're trying this on a Windows machine, or by running the gs3-server.sh script from a Linux/Mac terminal. Press the Enter Library button.

  1. Preview the incremen (Incremental With Manifests) collection.

    Throughout this tutorial, when previewing an (incrementally) rebuilt collection, make sure to reload any web page in the collection in order to ensure you're seeing any changes you've made. A "force reload", also referred to as a "hard refresh", is better: either hold down Ctrl while clicking the reload/refresh button, or press Ctrl+F5 in some browsers or Ctrl+Shift+R in others to make the browser do a force reload.

    When previewing, try searching for "kouprey" and you should get results, as this term occurs in the document b18ase.

    Next, try searching for "groundnuts" and no documents should match.

Incrementally adding some additional new documents to a collection

  1. If you want, you can use GLI to drag and drop the fb33fe, fb34fe and wb34te folders, located in the incr_build/more-files subfolder of sample files, into your collection.

    Alternatively, you can use a File Browser to copy the folders fb33fe, fb34fe and wb34te, located in the incr_build/more-files sample files subfolder, into your collection's import folder at web\sites\localsite\collect\incremen\import.

    The above step will only have gathered 3 new documents into your collection. However, since the changes have not been built, previewing at this stage will make no difference.

  1. We want to build just the newly added documents into the collection if possible, instead of rebuilding everything. This time, instead of running full-rebuild, we'll be running the incremental-import and incremental-buildcol scripts to perform the two phases of a Greenstone build operation incrementally, these being the import and buildcol phases. Incremental building allows us to (re)build just what is necessary, rather than everything.

    Since we know exactly which files have been added and thus which files need to be built, we can write a manifest file specifying this. The manifest files used by the Greenstone incremental building process are just XML files that can be created and edited in a plain text editor, and which indicate which files need to be (re)processed by a Greenstone incremental build operation.

    We've already prepared the manifest files we'll be using in this tutorial exercise for you. Use a File Browser to copy the manifests subfolder from the incr_build sample files into your incremen collection folder that's located inside your Greenstone installation directory (at web\sites\localsite\collect\incremen).

    In a text editor, open the add-new-files.xml manifest file found in the newly copied manifests subfolder. Inspect the contents of this manifest file. It should contain:

    <?xml version="1.0" encoding="UTF-8"?>
    <Manifest>
    <Index>
    <Filename>fb33fe/fb33fe.htm</Filename>
    <Filename>fb34fe/fb34fe.htm</Filename>
    <Filename>wb34te/wb34te.htm</Filename>
    </Index>
    </Manifest>

    The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword <Index>). Since these documents are located inside their own subfolders when copied into the import folder, the manifest file also indicates the relative folder structure of these documents (relative to the collection), e.g. "fb33fe/fb33fe.htm" shows that the fb33fe.htm HTML document is located in the folder fb33fe. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.

  1. Return to the terminal you had left open. We can finally run the commands for the incremental build operation.

    Use the terminal to first run the incremental import stage:

    perl -S incremental-import.pl -incremental -manifest manifests/add-new-files.xml -site localsite incremen

    The build log output will end with the messages

    * 3 documents were considered for processing
    * 3 documents were processed and included in the collection"

    This means just the 3 newly added documents were imported, just as specified by our manifest file add-new-files.xml.

    Once that incremental-import command has finished running, start off the incremental buildcol stage of the build process:

    perl -S incremental-buildcol.pl -activate -site localsite incremen

    The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (whether each is to be Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The -activate flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone server is running.

  1. Preview the collection either by running the Greenstone Server Interface application, if it isn't already running, or by starting the Greenstone server from the command line with the command:

    ant start

    (To stop the Greenstone server at any point, use the command ant stop. To stop-then-start it, you'd use ant restart.)

    When the server is runnning, preview your library home page, located by default at http://localhost:8383/greenstone3/library. Visit the Incremental with Manifests collection and click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them. For example, searching for "groundnuts" again should return a result this time, since this term occurs in the newly added document fb33fe.

Incrementally deleting some documents from a collection

  1. Inspect the delete-some-files.xml manifest file (located in your increment collection folder's manifests subfolder). It contains:

    <?xml version="1.0" encoding="UTF-8"?>
    <Manifest>
    <Delete>
    <OID>b18ase</OID>
    <OID>fb33fe</OID>
    </Delete>
    </Manifest>

    As per the above manifest file, the operation to be performed by an incremental build is a <Delete> operation on two documents. For the delete operation, the documents are not indicated by the <Filename> XML element, but by the <OID> element which specifies the object identifier. We need to use the OID here because we're telling Greenstone precisely what the identifiers of the documents are that we wish to have removed from our collection. The identifiers of every built document in a Greenstone collection are specified in the Identifier field of the document's doc.xml file located in the collection's archives folder. The doc.xml file is the Greenstone-specific XML format in which Greenstone stores documents already imported.

    For instance, to find the identifier of the b18ase.htm document in your built collection, open up web\sites\localsite\collect\incremen\archives\b18ase.dir\doc.xml in a text editor. Then scroll down, looking for a piece of Greenstone extracted metadata labelled Identifier, which is the OID for this document:

    <Metadata name="Identifier">b18ase</Metadata>

    The above value for the document identifier is what's used in the delete-some-files.xml manifest file to refer to this document. This document is one of two that are to be deleted as per the manifest file. Make sure to close the doc.xml file if you have it open.

  1. Finally, let's incrementally rebuild the collection, specifying the manifest file that Greenstone should use this time to carry out the incremental build operation. As before, there are two steps.

    First run the modified incremental import command:

    perl -S incremental-import.pl -incremental -manifest manifests/delete-some-files.xml -site localsite incremen

    From the build output you will notice that 0 documents would have been considered for importing, because documents are only being deleted this time around, and none being newly added.

    When the incremental-import has finished running, run the same incremental buildcol command as before (it doesn't change):

    perl -S incremental-buildcol.pl -activate -site localsite incremen

    If you were to scroll through the buildcol output in the terminal this time, you would see the following:

    GreenstoneXMLPlugin: processing fb33fe.dir\doc.xml
    GreenstoneXMLPlugin: processing b18ase.dir\doc.xml

    Only these 2 files were actually processed by buildcol, and that's because the manifest specified they were being deleted.

  1. When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results. For example, search for "kouprey" again. Check that when you search for the term this time, that no documents matched the query. (Since it only occurred in document b18ase, which has now been removed from the collection.) Likewise, searching for "groundnuts" should not return results either, because document fb33fe wherein it occurred has also been removed.

Editing a document's text and metadata, and then incrementally rebuilding the collection

  1. Inspect the mod-text-and-meta.xml manifest file (located in incremen/manifests) in a text editor. It should contain:

    <?xml version="1.0" encoding="UTF-8"?>
    <Manifest>
    <Reindex>
    <Filename>fb34fe/fb34fe.htm</Filename>
    <Filename>b20cre/b20cre.htm</Filename>
    </Reindex>
    </Manifest>

    Note the <Reindex> used this time. It indicates which documents that are already in the collection are to be re-processed when the collection is incrementally rebuilt as per this manifest file.

  1. Open up the file fb34fe/fb34fe.htm of your incremen collection's import folder in a text editor and add, remove or change some text nested anywhere in between the HTML tags within the <BODY> tag. Be careful not to partially modify HTML element names or HTML entities (entities start with an ampersand, &, and end with a semi-colon, ;), as doing so can make your text contents invalid HTML.

    Save and close the edited file.

  1. Start up GLI. Open the incremen collection and go to the Enrich panel. Add or modify dc.Title metadata for the b20cre document. Do not accidentally build the collection using GLI.

  1. Quit GLI. Optionally run the Greenstone server application if it isn't already running.

    In the above two steps, we've modified the text contents of document fb34fe and the metadata associated with b20cre. Our mod-text-and-meta.xml manifest file already indicates that these two files are to be reindexed, so we can go ahead and incrementally rebuild the collection with this manifest file.

  1. Run the incremental rebuild operation to re-process just these two files. To do so, pass the mod-text-and-meta.xml manifest file this time.

    First run:

    perl -S incremental-import.pl -incremental -manifest manifests/mod-text-and-meta.xml -site localsite incremen

    At the end of importing, you'd see the following messages displayed in the terminal, because only the 2 documents marked to be reindexed as per the new manifest are processed:

    * 2 documents were considered for processing
    * 2 were processed and included in the collection"

    Now run:

    perl -S incremental-buildcol.pl -activate -site localsite incremen

  1. Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added and confirm that document fb34fe turns up in the results. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the b20cre document in the Titles browsing classifier.

In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly which files need to be rebuilt and how (whether any need to be Indexed, Deleted or Reindexed). Greenstone also has an automatic incremental rebuild feature, sparing you the need to specify a manifest file in the import phase. Omitting the manifest argument in the above exercises activates this behaviour. However, this is typically slower, because Greenstone now needs to scan the entire import folder and compare this with the information in the archives folder to determine what has changed.

  1. Now repeat all the above exercises in the same sequence once again, but with a new collection called autoincr also based on the Demo collection. Remember to make document level for searching the default. And build the collection the first time around with perl -S full-rebuild.pl -site localsite autoincr, also largely as before. However, this time don't pass in any manifest file as an argument to the subsequent rebuild commands which use the incremental-import.pl script. And before running rebuild commands for the delete operation this time, manually delete the following physical folders from the import directory: web/sites/localsite/collect/incremen/import/b18ase and web/sites/localsite/collect/incremen/import/fb33fe, as now there is no manifest file letting greenstone now which documents are "deleted" (so now they need to be actually deleted for greenstone to automatically detect that they should not be included in the rebuilt collection). So you'd be running these commands after each change this time:

    perl -S incremental-import.pl -incremental -site localsite autoincr
    perl -S incremental-buildcol.pl -activate -site localsite autoincr

    After each incremental build, preview your autoincr collection to check that the browsing classifiers contain the expected documents and that searching returns the expected results.

Automatic incremental indexing

Just as there is the command full-rebuild.pl to completely build a collection from scratch, there is also the command incremental-rebuild.pl. The final exercise you have just completed could equally have been achieved by running the following after each change:

perl -S incremental-rebuild.pl -site localsite autoincr

For every collection, the import phase can be run incrementally (either using a manifest file or automatically), however, the ability for the buildcol phase to be incremental depends on the indexer in use. Lucene and Solr indexers support incremental indexing, but the MG and MGPP indexers do not. A warning is issued if you attempt to run the buildcol phase incrementally when the chosen indexer does not support this.


Copyright © 2005-2019 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”