Greenstone tutorial exercise
Building and searching with different indexersGreenstone supports three indexers MG, MGPP and Lucene.
MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.
MGPP is a re-implementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++.
Lucene (http://lucene.apache.org/) is a java-based, full-featured text indexing and searching system developed by Apache. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide, and is the default indexer for new collections.
Build with Lucene
- Start a new collection (File → New...) called Demo Lucene and base it on the Demo Collection (lucene-jdbm-demo) collection, fill out its fields appropriately.
- In the Gather panel, select Documents in Greenstone Collections, then select and open up localsite → Demo Collection (lucene-jdbm-demo). This will display the documents in the Greenstone demo collection. Drag all 11 folders in the demo folder into the new collection.
- Go to the Enrich panel, look at the metadata that is associated with each directory. Go to the Search Indexes section in the Design panel. Look at the top right area of the panel, where you will see that the Lucene indexer is already in use. This is because the Demo Collection (lucene-jdbm-demo) collection, which this collection is based on, uses the Lucene indexer.
Build and preview the collection.
Search with Lucene
- Lucene provides single letter and multiple letter wildcards and range searching. The query syntax could be quite complicated (for more information, please see http://www.lucenetutorial.com/lucene-query-syntax.html). Here we will learn how to use the wildcards while constructing queries.
* is a multiple letter wildcard. To perform a multiple letter wildcard search, append * to the end of the query term. For example, econom* will search for words like econometrics, economist, economical, economy, which have the common part econom but different word endings.
- To perform a single letter wildcard search, use ? instead. For example, search for economi?? will only match words that have two and only two letters left after economi, such as economist, economics, and economies.
- Please note that stopwords are used by default with Lucene indexer, so searching for words like the will match 0 documents. This is explained in a message on the search page, which states that such words are too common and were ignored.
Build with MGPP
- Start a new collection called Greenstone Demo MGPP and also base it on the Demo Collection (lucene-jdbm-demo).
- In the Gather panel, drag all 11 folders from Documents in Greenstone Collections → localsite → Demo Collection (lucene-jdbm-demo) into the new collection.
- In the Search Indexes section of the Design panel, you will notice that the active indexer is Lucene. Click the Change... button at the right top corner of the panel. A new window will pop up for selecting the Indexers. After selecting an indexer, a brief description will appear in the box below. Select MGPP and click OK. Please note that the Assigned Indexes section may have changed accordingly.
- There are three options at the bottom of the panel — Stem, Casefold and Accent fold. Notice that all three are enabled.
- In the Indexing Levels section, also select section, if it isn't already, but make document the default.
Build and preview the collection.
Search with MGPP
- MGPP supports stemming, casefolding and accentfolding. By default, searching in collections built with the MGPP indexer is set to whole word must match and ignore case differences. So searching econom will return 0 documents. Searching for fao and FAO return the same result — 89 word counts and 11 matched documents.Go to the text search page by clicking the text search button at the top right corner. You can see that stem is off, which means the word endings option is set to whole word must match. And case (folding) is set to ignore case differences.
- Sometimes we may want to ignore word endings while searching so as to match different variations of the term. Change the stem option from off to on. This will change the search settings from the default, which is that the whole word must match, to ignore word endings. Now try searching for econom again. This time, 9 documents are found.Please note that word endings are determined according to the third-party stemming tables incorporated in Greenstone, not by the user. Thus the searches may not do precisely what is expected, especially when cultural variations or dialects are concerned. In addition, not all languages support stemming; only English and French have stemming at the moment.Change the stem option back to off (whole word must match) to avoid confusion later on.
- Sometimes we may want to search for the exact term, that is, differentiate the upper cases from lower cases. In the form search page, switch case folding to off (upper/lower case must match). Now try searching for fao and FAO respectively. Notice the search results are different this time, with fao not returning any results.Change the case folding option back to on (ignore case differences) to avoid confusion later on.
Use search mode hotkeys with query term
MGPP has several hotkeys for setting the search modes for a query term. These hotkeys explicitly set the stem option and the case option for the query being constructed. Use them in the plain text search or form search.
#s and #u are hotkeys for the stem option. Appending #s to a query term will specifically enable the ignore word endings function. For example, click on the Form search button and try searching for econom#s. 9 documents are found, which is the same as in the previous section.
- Appending #u to a query term will explicitly set the current search to whole word must match. Note that using hotkeys will only affect that query term. That is, hotkeys are used per term. For example, if a query expression contains more than one term, some terms can have hotkeys and others not, and the hotkeys can be different for different terms. This provides a fine-grained control of the query, whereas changing the controls for a search field on the advanced search page will apply to all the query terms in that field.
- Hotkeys #i and #c control the case sensitivity. Appending #i to a query term will explicitly set the search to ignore case differences (i.e. case insensitive). For example, a search for fao#i returns 11 documents.
- In contrast, appending #c will specifically turn off the casefolding, that is, upper/lower case must match then.
- Finally, the hotkeys can also be used in combination. For example, you can append #uc to a query term so as to match the whole term (without stemming) and in its exact form (differentiate upper cases and lower cases).For example, try searching for econom#si and compare against against the results when searching for econom#sc and for Econom#sc. The first search is case insensitive and the last two searches are both case sensitive. The number of results for the last two searches should add up to the number of search results for the first search.
A quick reference of the search mode hotkeys in MGPP
#s ignore word endings
#u whole word must match
#i ignore case differences
#c upper/lower case must match