Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: A large collection of HTML files—Tudor
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.09

Enhanced collection of HTML files—Tudor

We return to the Tudor collection and add metadata that expresses a subject hierarchy. Then we build a classifier that exploits it by allowing readers to browse the documents about Monarchs, Relatives, Citizens, and Others separately.

Adding hierarchically-structured metadata and a Hierarchy classifier

  1. Open up your tudor collection (the original version, not the webtudor version, in case you've already done that tutorial), switch to the Enrich panel and select the citizens folder (a subfolder of → tudor). Set its dc.Subject and Keywords metadata to Tudor period|Citizens. The vertical bar ("|") is a hierarchy marker. Selecting a folder and adding metadata has the effect of setting this metadata value for all files contained in this folder, its subfolders, and so on. A popup alerts you to this fact. Click <OK> to close the popup.

  1. Repeat for the monarchs and relative folders, setting their dc.Subject and Keywords metadata to Tudor period|Monarchs and Tudor period|Relatives respectively. Note that the hierarchy appears in the Existing values for dc.Subject and Keywords area.

    If you don't want to see the popup each time you add folder level metadata, tick the Do not show this warning again checkbox; it won't be displayed again.

  1. Finally, select all remaining files—the ones that are not in the citizens, monarchs, or relative folders—by selecting the first and shift-clicking the last. Set their dc.Subject and Keywords metadata to Tudor period|Others and click outside the cell for the metadata to be assigned. This is done in a single operation (there is a short delay before it completes).

    When multiple files are selected in the left hand collection tree, all metadata values for all files are shown on the right hand side. Items that are common to all files are displayed in black—e.g. dc.Subject and Keywords—while others that pertain to only one or some of the files are displayed in grey—e.g. any extracted metadata.

    Metadata inherited from a parent folder is indicated by a folder icon to the left of the metadata name. Select one of the files in the relative folder to see this.

  1. Switch to the Design panel and select Browsing Classifiers from the left-hand list. Set the menu item for Select classifier to add to Hierarchy; then click <Add Classifier...>.

  1. A window pops up to control the classifier's options. Change the metadata to dc.Subject and Keywords and then click <OK>.

  1. For tidiness' sake, remove the classifier for Source metadata (included by default) from the list of currently assigned classifiers, because this adds little to the collection.

  1. Now switch to the Create panel, build the collection, and preview it. Choose the new subjects link that appears in the navigation bar, and click the bookshelves to navigate around the four-entry hierarchy that you have created.

Adding a hierarchical phrase browser (PHIND)

Next we'll add an interactive hierarchical phrase browsing classifier to this collection. Java applet support is being or has been phased out in various browsers and browser versions. As a result the following will not work on Microsoft Edge and some other browsers.

  1. Switch to the Design panel and choose the Browsing Classifiers item from the left-hand list.

  1. Choose Phind from the Select classifier to add menu. Click <Add Classifier...>. A window pops asking for configuration options: leave the values at their preset defaults (this will base the phrase index on the full text) and click <OK>.

  1. Build the collection again, preview it, and try out the new Phrase browse option in the navigation bar. An interesting PHIND search term for this collection is "king". Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing.

    The Phind phrase browser is a Java applet. To be able to view applets in a browser, you will need a JRE installed and, from Java 7 onwards, will need to additionally add your Greenstone digital library home URL (http://localhost:8383 by default) to the Exception Site List via the Security tab of your Java Control Panel. You can search Windows for "Configure Java" to locate the Java Control Panel application, alternatively, go to Start → All Programs → Java → Configure Java. You may need to clear the browser history of your Microsoft Edge or Internet Explorer and relaunch the browser for the changes to the Exception List settings to take effect. We have found that installing web browsers before installing a JRE allows browsers to find your JRE and run applets. If you're installing browsers after the JRE has already been installed, then your browser should prompt you to install the JRE again when trying to view Java applets. For further information see on how to enable Java in a web browser and to locate the Java Control Panel for your operating system.

Partitioning the full-text index based on metadata values

Next we partition the full-text index into four separate pieces. To do this we first define four subcollections obtained by "filtering" the documents according to a criterion based on their dc.Subject and Keywords metadata. Then an index is assigned to each subcollection. This will enable users to restrict a search to a subset of the documents.

  1. Switch to the Design panel, and click Partition Indexes.

  1. Ensure that the Define Filters tab is selected (the default). Define a subcollection filter with name monarchs that matches against dc.Subject and Keywords, and type Monarchs as the regular expression to match with. Click <Add Filter>. This filter includes any file whose dc.Subject and Keywords metadata contains the word Monarchs.

  1. Define another filter, relatives, which matches dc.Subject and Keywords against the word Relatives. Define a third and fourth, citizens and others, which matches it against the words Citizens and Others respectively.

  1. Having defined the subcollection filters, we partition the index into corresponding parts. Click the Assign Partitions tab. Select the citizens subcollection and click <Add Partition>. Next select monarchs, and click <Add Partition>. Repeat for the other two subcollections, so that you end up with four partitions, one based on each subcollection filter.

    The order they appear in the Assigned Subcollection Partitions list is the order they will appear in the drop down menu on the search page. You can change the order by using the <Move Up> and <Move Down> buttons.

  1. Build and preview the collection.

  1. The form search page includes a pulldown menu that allows you to select one of these partitions for searching. For example, try searching the relatives partition for mary and then search the monarchs partition for the same thing.

  1. To allow users to search the collection as a whole as well as each subcollection individually, return to the Partition Indexes section of the Design panel and select the Assign Partitions tab. Select all four subcollections by either checking their boxes or press the Select All button, and click <Add Partition>.

  1. To ensure that the combined index appears first in the list on the reader's web page, use the <Move Up> button to get it to the top of the list here in the Design panel. Then build and preview the collection.

  1. The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the Search section of the Format panel. The single filter partitions have sensible default text, but the combined partition does not. Set the Display text for the combined partition to "all". Preview the collection.

  1. Search for the term Mary again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) in the search results for the 4 individual indexes add up to the number of words for the all index.

Controlling the building process

Finally we look at how the building process can be controlled. Developing a new collection usually involves numerous cycles of building, previewing, adjusting some enrich and design features, and so on. While prototyping, it is best to temporarily reduce the number of documents in the collection. This can be accomplished through the maxdocs parameter to the building process.

  1. Switch to the Create panel. Expand the top panel to be able to see the options for collection building. Scroll to view them all. Select maxdocs and set its numeric counter to 3. (When in GLI's Expert Mode, the maxdocs option for the import process are located under the Import Options of the Create panel.) Now build.

  1. Preview the newly rebuilt collection's titles page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process.

  1. Go back to the Create panel and turn off the maxdocs option. Rebuild the collection so that all the documents are included.

Copyright © 2005-2019 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”