Greenstone tutorial exercise
Incrementally building a collection using the command line
To allow you to quickly try out and experiment with our tutorial exercises, we tend to keep the number of sample files small. Every time you rebuild these collections, for simplicity, the default settings used in Greenstone mean that the previous version built is removed in its entirety. We refer to this as a full-rebuild. When building larger collections, this is inefficient.Greenstone also has the ability to rebuild collections incrementally: this means the previous version of the collection is retained and only the changes detected need to be incorporated. There are, however, quite a few aspects to incremental building to control. This is the focus of this tutorial exercise.To gain the best level of understanding, this tutorial builds collections using the command line.
- In GLI, create a new collection called Incremental With Manifests and base it on the Greenstone demo collection. The short name of this collection will become incremen, and this will be the name of the collection's folder on the file system.
- Use GLI's Workspace view to navigate to this tutorial's sample files folder, incr_build. It will contain a folder named import. Open this. In GLI's Gather panel, drag and drop the 3 subfolders into your new collection. (You can also carry out this step using a file browser to copy the contents of the incr_build\import sample files folder into collect\incremen\import.) Go to the Design panel and select Search Indexes. Press the Change... button in the top right to change the indexer in use to Lucene.
- Do not build the collection in GLI. We'll be building and rebuilding manually, from the command-line terminal. So close GLI. You can choose to run the Greenstone server at any stage, however.
- In a text editor, open your incremen collection's collect.cfg file located in collect\incremen\etc.Change the OIDtype setting's value to full_filename, which means the identifiers generated and used by Greenstone for this collection's documents will be based on their full filenames (their filename appended to any containing directories relative to the collection's import folder). For any collection that you want to incrementally rebuild, make sure that it was similarly built with the OIDtype set to full_filename. A collection that is built with this setting will allow us to refer to the files by name in the <Filename> elements of any manifest file that we use to incrementally rebuild it. These <Filename> elements will then identify which files are to be indexed if newly added, and which are to be re-indexed, as should happen if a document or its metadata has been edited. (For specifying which files are to be deleted, the document identifier will be used instead of the filename.)
- Since this is the first time we're building our collection, we're going to do a complete build. And we'll use the command line to do so. Open a terminal. To open a terminal in Windows, press Ctrl+r and type cmd in the Run dialog that displays. To open a terminal on a Mac machine, click on menu Go → Utilities → Terminal. Use the terminal to cd into your Greenstone installation folder. For instance, if you have your Greenstone installed on Windows as "Greenstone" within your account folder at C:\Users\me, then type the following in your terminal and hit Enter:
cd C:\Users\me\Greenstone
On Linux or Macs, the general command is the same, but the installed location would be different and the slashes go the other way. For example, if installed in /Users/me/Greenstone3, you'd type the following and hit Enter:
cd /Users/me/Greenstone3
Now you're ready to set up the Greenstone environment in your terminal. On Windows, type the following into your terminal and hit Enter again:
setup.bat
On Linux and Mac:
source ./setup.bash
When using a terminal, you'll need to hit Enter after each command in order to execute the command you just finished typing. We won't repeat this instruction any more. Just remember to hit Enter after every complete command entered into a terminal.With the terminal now operating within your Greenstone installation folder, and with the Greenstone environment now set up and ready, type the following commands to do a complete build of your new collection. Although the command contains the word "rebuild" in it, since this is the first time the collection's being built, it will just build it.
perl -S full-rebuild.pl incremen
Preview the collection. If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone Server Interface application), then run it from the Start Menu on Windows now. You could also run the Greenstone server by running the gs2-server.bat script in the terminal if using a Windows, or running the gs2-server.sh script from a Linux/Mac terminal.When previewing, try searching for "kouprey" and you should get results, as this term occurs in the document b18ase.For the rest of this tutorial exercise, leave open the terminal in which you have set up your Greenstone's environment. We'll be using it throughout.
Incrementally adding some additional new documents to a collection
- If you want, you can use GLI to drag and drop the fb33fe, fb34fe and wb34te folders, located in the incr_build/more-files subfolder of sample files, into your collection.Alternatively, you can use a File Browser to copy the folders fb33fe, fb34fe and wb34te, located in the incr_build/more-files sample files subfolder, into your collection's import folder at collect\incremen\import.The above step will only have gathered 3 new documents into your collection. However, since the changes have not been built, previewing at this stage will make no difference.
- We want to build just the newly added documents into the collection if possible, instead of rebuilding everything. This time, instead of running full-rebuild, we'll be running the incremental-import and incremental-buildcol scripts to perform the two phases of a Greenstone build operation incrementally, these being the import and buildcol phases. Incremental building allows us to (re)build just what is necessary, rather than everything.Since we know exactly which files have been added and thus which files need to be built, we can write a manifest file specifying this. The manifest files used by the Greenstone incremental building process are just XML files that can be created and edited in a plain text editor, and which indicate which files need to be (re)processed by a Greenstone incremental build operation.We've already prepared the manifest files we'll be using in this tutorial exercise for you. Use a File Browser to copy the manifests subfolder from the incr_build sample files into your incremen collection folder that's located inside your Greenstone installation directory (at collect\incremen).In a text editor, open the add-new-files.xml manifest file found in the newly copied manifests subfolder. Inspect the contents of this manifest file. It should contain:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Index>
<Filename>fb33fe/fb33fe.htm</Filename>
<Filename>fb34fe/fb34fe.htm</Filename>
<Filename>wb34te/wb34te.htm</Filename>
</Index>
</Manifest>
The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword <Index>). Since these documents are located inside their own subfolders when copied into the import folder, the manifest file also indicates the relative folder structure of these documents, e.g. "fb33fe/fb33fe.htm" shows that the fb33fe.htm HTML document is located in the folder fb33fe. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.
- Return to the terminal you had left open. We can finally run the commands for the incremental build operation.Use the terminal to first run the incremental import stage:
perl -S incremental-import.pl -manifest manifests/add-new-files.xml incremen
Once that finishes running, start off the incremental buildcol stage of the build process:
perl -S incremental-buildcol.pl -activate incremen
The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The activate flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone server is running.
- Preview the collection either by running the Greenstone Server Interface application, if it isn't already running, or by starting the Greenstone server from the command line with the command:
gsicontrol.bat web-start
(To stop the Greenstone server at any point, use the command gsicontrol.bat web-stop. To stop-and-start it, you'd use gsicontrol.bat web-restart. On Linux/Mac, use the equivalent script gsicontrol.sh for each command, e.g. ./gsicontrol.sh web-start.)When the server is runnning, preview your library home page, located by default at http://localhost:8282/greenstone/cgi-bin/library.cgi. Visit the Incremental with Manifests collection and click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them. For example, searching for "groundnuts" should return results, since this term occurs in the newly added document fb33fe.
Incrementally deleting some documents from a collection
- Inspect the delete-some-files.xml manifest file (located in your increment collection folder's manifests subfolder). It contains:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Delete>
<OID>b18ase-b18ase_htm</OID>
<OID>fb33fe-fb33fe_htm</OID>
</Delete>
</Manifest>
As per the above manifest file, the operation to be performed by an incremental build is a <Delete> operation on two documents. For the delete operation, the documents are not indicated by the <Filename> XML element, but by the <OID> element which specifies the object identifier. We need to use the OID here because we're telling Greenstone precisely what the identifiers of the documents are that we wish to have removed from our collection. The identifiers of every built document in a Greenstone collection are specified in the Identifier field of the document's doc.xml file located in the collection's archives folder. The doc.xml file is the Greenstone-specific XML format in which Greenstone stores documents already imported.For instance, to find the identifier of the b18ase.htm document in your built collection, open up collect\incremen\archives\b18ase-b.dir\doc.xml in a text editor. Then scroll down, looking for a piece of Greenstone extracted metadata labelled Identifier, which is the OID for this document:
<Metadata name="Identifier">b18ase-b18ase_htm</Metadata>
The above value for the document identifier is what's used in the delete-some-files.xml manifest file to refer to this document. This document is one of two that are to be deleted as per the manifest file. Make sure to close the doc.xml file if you have it open.
- So then, let's first physically remove these two documents from our collection, so that the contents of the import folder match what the manifest specifies: use a file browser to remove the folders b18ase and fb33fe from the collection's import folder.
- Finally, let's incrementally rebuild the collection, specifying the manifest file that Greenstone should use this time to carry out the incremental build operation. As before, there are two steps.First run the modified incremental import command:
perl -S incremental-import.pl -manifest manifests/delete-some-files.xml incremen
When that has finished running, run the same incremental buildcol command as before (it doesn't change):
perl -S incremental-buildcol.pl -activate incremen
- When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results. For example, search for "kouprey" again. Check that when you search for the term this time, that no documents matched the query. (Since it only occurs in document b18ase, which has now been removed.)
Editing a document's text and metadata, and then incrementally rebuilding the collection
- Inspect the mod-text-and-meta.xml manifest file (located in incremen/manifests) in a text editor. It should contain:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Reindex>
<Filename>fb34fe/fb34fe.htm</Filename>
<Filename>b20cre/b20cre.htm</Filename>
</Reindex>
</Manifest>
Note the <Reindex> used this time. It indicates which documents that are already in the collection are to be re-processed when the collection is incrementally rebuilt as per this manifest file.
- Open up the file fb34fe/fb34fe.htm of your incremen collection's import folder in a text editor and add, remove or change some text nested anywhere in between the HTML tags within the <BODY> tag. Be careful not to partially modify HTML element names or HTML entities (entities start with an ampersand, &, and end with a semi-colon, ;), as doing so can make your text contents invalid HTML.
Save and close the edited file.
- Start up GLI. Open the incremen collection and go to the Enrich panel. Add or modify dc.Title metadata for the b20cre document. Do not accidentally build the collection using GLI.
- Quit GLI.In the above two steps, we've modified the text contents of document fb34fe and the metadata associated with b20cre. Our mod-text-and-meta.xml manifest file already indicates that these two files are to be reindexed, so we can go ahead and incrementally rebuild the collection with this manifest file.
- Run the incremental rebuild operation to re-process just these two files. To do so, pass the mod-text-and-meta.xml manifest file this time.First run:
perl -S incremental-import.pl -manifest manifests/mod-text-and-meta.xml incremen
Followed by:
perl -S incremental-buildcol.pl -activate incremen
- Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the b20cre document in the Titles browsing classifier.
In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly which files need to be rebuilt and how (whether they need to be Indexed, Deleted or Reindexed). Greenstone also has an automatic incremental rebuild feature, sparing you the need to specify a manifest file in the import phase. Omitting the manifest argument in the above exercises activates this behaviour, however, this is typically slower, because Greenstone now needs to scan the entire import folder and compare this with the information in the archives folder to determine what has changed.Now repeat all the above exercises in the same sequence once again, but with a new collection called autoincr also based on the Demo collection. But this time, don't pass in the manifest file as an argument to the import.pl script. After each incremental build, preview your autoincr collection to check that the Browsing classifiers contain the expected documents and that searching returns the expected results.
Incrementally indexing automatically
Just as there is the command full-rebuild.pl to completely build a collection from scratch, there is also the command incremental-rebuild.pl. The final exercise you have just completed could equally have been achieved by running:
perl -S incremental-rebuild.pl autoincr
For every collection, the import phase can be run incrementally (either using a manifest file or automatically), however, the ability for the buildcol phase to be incremental depends on the indexer in use. Lucene and Solr indexers support incremental indexing, but the MG and MGPP indexers do not. A warning is issued if you attempt to run the buildcol phase incrementally when the chosen indexer does not support this.