Greenstone tutorial exercise
Building and searching with different indexersGreenstone supports three indexers MG, MGPP and Lucene.
MG is the original indexer used by Greenstone which is described in the book "Managing Gigabytes". It does section level indexing and compression of the source documents. MG is implemented in C.
MGPP is a re-implementation of MG that provides word-level indexes and enables proximity, phrase and field searching. MGPP is implemented in C++ and is the default indexer for new collections.
Lucene (http://lucene.apache.org/) is a java-based, full-featured text indexing and searching system developed by Apache. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which MG and MGPP can't provide.
Build with Lucene
- Start a new collection (File → New...) called Demo Lucene and base it on the Greenstone demo (demo) collection, fill out its fields appropriately.
- In the Gather panel, select Documents in Greenstone Collections, then select and open up Greenstone demo (demo). This will display the documents in the Greenstone demo collection. Drag all 11 folders in the demo folder into the new collection.
If you haven't installed the Greenstone demo (demo) collection yet, you can download the demo.zip file from the link above, unzip it and put it into the collect folder in your Greenstone installation.
- Go to the Enrich panel, look at the metadata that is associated with each directory. Go to the Search Indexes section in the Design panel. You'll see that the MGPP indexer is in use because the Greenstone Demo collection, which this collection is based on, uses the MGPP indexer.
- Click the Change... button at the right top corner of the panel. A new window will pop up for selecting the Indexers. After selecting an indexer, a brief description will appear in the box below. Select Lucene and click OK. Please note that the Assigned Indexes section may have changed accordingly.
Build and preview the collection.
Search with Lucene
- Lucene provides single letter and multiple letter wildcards and range searching. The query syntax could be quite complicated (for more information please see http://www.lucenetutorial.com/lucene-query-syntax.html). Here we will learn how to use the wildcards while constructing queries.
* is a multiple letter wildcard. To perform a multiple letter wildcard search, append * to the end of the query term. For example, econom* will search for words like econometrics, economist, economical, economy, which have the common part econom but different word endings.
- To perform a single letter wildcard search, use ? instead. For example, search for economi?? will only match words that have two and only two letters left after economi, such as economist, economics, and economies.
- Please note that stopwords are used by default with Lucene indexer, so searching for words like the will match 0 documents. This is explained in a message on the search page, which states that such words are too common and were ignored.
Build with MGPP
- Start a new collection called Greenstone Demo MGPP and also base it on the Greenstone demo (demo).
- In the Gather panel, drag all 11 folders from Documents in Greenstone Collections → Greenstone demo (demo) into the new collection.
- In the Search Indexes section of the Design panel, you will notice that the active indexer is MGPP, since this is the default. (If not, you'd click the Change... button, select MGPP and click OK, in which case the Assigned Indexes section and its options may change accordingly.)
- There are three options at the bottom of the panel — Stem, Casefold and Accent fold. Notice that all three are enabled. Once an option is enabled, it will also appear in the collection's PREFERENCES page and can be turned on or off from there.
- In the Indexing Levels section, also select section, if it isn't already.
Build and preview the collection.
Search with MGPP
- MGPP supports stemming, casefolding and accentfolding. By default, searching in collections built with the MGPP indexer is set to whole word must match and ignore case differences. So searching econom will return 0 documents. Searching for fao and FAO return the same result — 85 word counts and 11 matched documents.Go to the PREFERENCES page by clicking the PREFERENCES button at the top right corner. You can see that the Word endings: option is set to whole word must match and the Case differences: option is set to ignore case differences.
- Sometimes we may want to ignore word endings while searching so as to match different variations of the term. Go to the PREFERENCES page and change the Word endings: option from whole word must match to ignore word endings. Click the set preferences button. Click Search. This time try searching for econom again. 9 documents are found.Please note that word endings are determined according to the third-party stemming tables incorporated in Greenstone, not by the user. Thus the searches may not do precisely what is expected, especially when cultural variations or dialects are concerned. Besides, not all languages support stemming, only English and French have stemming at the moment.Go to the PREFERENCES page and change back to whole word must match to avoid confusion later on. Click the set preferences button.
- Sometimes we may want to search for the exact term, that is, differentiate the upper cases from lower cases. Back in the PREFERENCES, set the Case differences: option from ignore case differences to upper/lower case must match. Click the set preferences button. Click Search. Now try searching for fao and FAO respectively this time, notice the difference in the results?Go back to the PREFERENCES page and change the Case differences: option back to ignore case differences to avoid confusion later on. Click set preferences button.
Use search mode hotkeys with query term
MGPP has several hotkeys for setting the search modes for a query term. These hotkeys explicitly set the Word endings: option and the Case differences: option for the query being constructed.
#s and #u are hotkeys for the Word endings: option. Appending #s to a query term will specifically enable the ignore word endings function. For example, try searching for econom#s. 9 documents are found, which is the same as in the previous section. Remember that we have set it back to whole word must match. This means using hotkeys will override the current preference settings.
- Appending #u to a query term will explicitly set the current search to whole word must match. Note that using hotkeys will only affect that query term. That is, hotkeys are used per term. For example, if a query expression contains more than one term, some terms can have hotkeys and others not, and the hotkeys can be different for different terms. This provides a fine-grained control of the query, whereas changing settings in the PREFERENCES page will affect the query as a whole.
- Hotkeys #i and #c control the case sensitivity. Appending #i to a query term will explicitly set the search to ignore case differences (i.e. case insensitive).
- In contrast, appending #c will specifically turn off the casefolding, that is, upper/lower case must match. For example, searching for fao#c returns 0 documents.
- Finally, the hotkeys can also be used in combination. For example, you can append #uc to a query term so as to match the whole term (without stemming) and in its exact form (differentiate upper cases and lower cases).For example, try searching for econom#si and compare against against the results when searching for econom#sc and for Econom#sc. The first search is case insensitive and the last two searches are both case sensitive.
A quick reference of the search mode hotkeys in MGPP
#s ignore word endings
#u whole word must match
#i ignore case differences
#c upper/lower case must match