Greenstone tutorial exercise
Scanned image collection
Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page.
The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by PagedImagePlugin, a plugin designed precisely for the kind of data we have in this example. For each document, an "item" file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series "Te Whetu o Te Tau") have text representations, and two (from "Te Waka o Te Iwi") have images only. Item files can also specify metadata. In our example the newspaper series is recorded as ex.Title and its date of publication as ex.Date. Issue ex.Volume and ex.Number metadata is also recorded, where appropriate. This metadata is extracted as part of the building process.
- Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents.
- In the Gather panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add PagedImagePlugin to the collection: click <Add Plugin>, because this plugin will be needed to process the item files.
PagedImagePlugin will process the item files, creating a document for each one with a separate section for each page listed. Thumbnail and screen-resolution sized images of each page image will be generated.
- Go to the Create panel, build the collection and preview the result. Search for "waka" and view one of the titles listed (all three appear as Te Whetu o Te Tau). Browse by Titles and view one of the Te Waka o Te Iwi newspapers. Note that only the Te Whetu o Te Tau newspapers have text; Te Waka o Te Iwi papers don't.
This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection.
Grouping documents by series title and displaying dates within each group
Under Titles, documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using the -bookshelf_type option to the List classifier, and tuning the classifier's format statement.
- In the Design panel, under the Browsing Classifiers section, delete the List classifier for ex.Source. This classifier is not much use.
- Select the classifier for dc.Title;ex.Title and click <Configure Classifier...>. Set bookshelf_type to always. This will create a bookshelf for each Title in the collection. Note, setting this option to duplicate_only will only create a bookshelf when more than one document shares a Title.
-
Build the collection, and preview the Titles list.
- Now we change the format statement for Titles to display more information about the documents. In the Format Features section of the Format panel, select the dc.Title;ex.Title classifier (CL1) in the Choose Feature list., and VList in the Affected Component list. Click <Add Format> to add this format statement to your collection.
Delete the contents of the HTML Format String box, and add the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → titles_tweak.txt.)
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[numleafdocs],[ex.Title] ([numleafdocs]),
Volume [ex.Volume] Number [ex.Number] Date [format:ex.Date]}
</td>
- Refresh in the web browser to view the new Titles list.As a consequence of using the bookshelf_type option of the List classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. It works by exploiting the fact that only bookshelf icons define [numleafdocs] metadata. For document nodes, Title is not displayed. Instead, Volume, Number and Date information are displayed.
Browsing documents by Date.
- Back in the Design panel, under the Browsing Classifiers section, add a DateList classifier, leaving its metadata option set to ex.Date.
-
Build the collection, and preview the Dates list.
- The Dates list groups documents by date. Greenstone's internal date format is YYYYMMDD, for example 18580601, and this is crucial for the DateList classifier to correctly parse date metadata and generate an ordered date list. However, the date has been made to look nice by adding a [format:] macro to Date metadata in the format statement.
- In the Format Features section of the Format panel, select All Features in the Choose Feature list, and DateList in the Affected Component list. Click <Add Format> to add this format statement to your collection. Replace the last line
<td>{Or}{[format:dc.Date],[format:exp.Date],[format:ex.Date]}</td>
with
<td>{Or}{[dc.Date],[exp.Date],[ex.Date]}</td>
Refresh in the web browser to view the new Dates list. The dates are now shown in internal format.
- Change the format statement back to reinstate the nicely formatted dates.
This can be done by selecting DateList in assigned format statements panel and clicking <Reset to Default>.
Displaying scanned images and suppressing dummy text
When you reach a newspaper, only its associated text is displayed. When either of the Te Waka o Te Iwi newspapers is accessed, the document view presents the message "This document has no text." No scanned image information (screen-view resolution or otherwise) is shown, even though it has been computed and stored with the document. This can be fixed by a format statement that modifies the default behaviour for DocumentText.
- In the Format Features section of the Format panel, select the DocumentText format statement. The default format string displays the document's plain text, which, if there is none, is set to "This document has no text." Change this to the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → doc_tweak.txt)
<table><tr>
<td valign=top>[srclink][screenicon][/srclink]</td>
<td valign=top>[Text]</td>
</tr></table>
Including [screenicon] has the effect of embedding the screen-sized image generated by switching the screenview option on in PagedImagePlugin. It is hyperlinked to the original image by the construct [srclink]...[/srclink]. This is a large image but it may be scaled by your browser.
This modification will display screenview image, but does nothing about the dummy text "This document has no text.", which will still be displayed. To get rid of this, edit the DocumentText format statement again and replace
<td valign=top>[Text]</td>
with
{If}{[NoText],,<td valign=top>[Text]</td>}
-
Preview the collection and view one of the Te Waka o Te Iwi documents. The line "This document has no text." should now be gone.
Searching at page level
- The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document.
- Go to the Search Indexes section of the Design panel. Remove the ex.Source index and check the section checkbox to build the indexes on section level as well as document level. Make section level the default by selecting its Default radio button.
- Set the display text used for the level drop-down menu by going to the Search section on the Format panel. Set the document level text to "newspaper", and the section level text to "page".
-
Build and preview the collection.Compare searching at "newspaper" level with searching at "page" level. A useful search term for this collection is "aroha".
Tidying up search results
You will notice that when searching for individual pages, a thumbnail of the newspaper image is displayed in the search results. For text pages like this, these are not very useful. Let's tell PagedImagePlugin not to generate thumbnails.
- In the Design panel, under the Document Plugins section, select PagedImagePlugin from the Assigned Plugins list and click <Configure Plugin...>. Switch on the create_thumbnail option and set its value to false.
-
Rebuild and preview the collection, doing a search at page level.
Search results at newspaper level display the original filename. Let's remove that also.
- Go to Format Features section of the Format panel in the Librarian Interface, choose All Features in Choose Feature list, and select the VList format statement from the list of assigned format statements. Remove the following from the last line of the format string:
{If}{[ex.Source],<br><i>([ex.Source])</i>}
Preview the collection.
You might notice that newspaper level search results only display the newspaper Title, and not any volume information, while page level search results only show a large scan of the newspaper page, the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show Volume and Number information, and for page results, the newspaper title as well as the page number.
- In the Format Features section, select Search in Choose Feature, and VList in Affected Component. Click <Add Format> to add this format to the collection. The previous changes modified VList, so they will apply to all VLists that don't have specific format statements. These next changes are made to SearchVList so will only apply to search results. The extracted Title for the current section is specified as [ex.Title] while the Title for the parent section is [parent:ex.Title]. Since the same SearchVList format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.Set the format statement to the following text (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak.txt):
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[parent:ex.Title],[parent:ex.Title] Volume [parent:ex.Volume] Number [parent:ex.Number]: Page [ex.Title],
[ex.Title] Volume [ex.Volume] Number [ex.Number]}
<br/><i>({Or}{[format:parent:ex.Date],[format:ex.Date],undated})</i></td>
</td>
Preview the search results. Items display newspaper Title, Volume, Number and Date, and pages also display the page number.
The collection you have just built involves a fairly complex document structure. There are two series of newspapers, Te Waka and Te Whetu.
In the Te Waka series there are two actual newspapers, Volume 1 Numbers 1 and 2. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 4 pages, numbered 5, 6, 7, 8. The page numbers increase consecutively through each volume, despite the fact that the volume is divided into different Numbers. Each page in the Te Waka series is represented by a single file, a GIF image of the page.
The Te Whetu series has three actual newspapers, Volume 1 Numbers 1, 2, and 3. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 5 pages, numbered 5, 6, 7, 8, 9; Number 3 has 5 pages, numbered 10, 11, 12, 13, 14. Again the page numbers increase consecutively through each volume. Each page in this series is represented by two files, a GIF image of the page and a text file containing the OCR’d text that appears on it.
The key to this structure is in the respective .item files. Here is a synopsis of the information they contain:
(9-1-1) Te Waka Volume 1 Number 1
p.1 gif
p.2 gif
p.3 gif
p.4 gif
(9-1-2) Te Waka Volume 1 Number 2
p.5 gif
p.6 gif
p.7 gif
p.8 gif
(10-1-1) Te Whetu Volume 1 Number 1
p.1 gif text
p.2 gif text
p.3 gif text
p.4 gif text
(10-1-2) Te Whetu Volume 1 Number 2
p.5 gif text
…
p.9 gif text
(10-1-3) Te Whetu Volume 1 Number 3
p.10 gif text
…
p.14 gif text