Greenstone tutorial exercise
A collection of Word and PDF files
You will need some source files like those in the sample_files → Word_and_PDF folder.
- Start a new collection called reports (File → New...) and base it on -- New Collection --.
- Copy all the .doc, .rtf, .pdf and .ps files from sample_files → Word_and_PDF → Documents into the collection. There are 9 files in all: you can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.)
- Switch to the Create panel, and build and preview the collection.
Viewing the extracted metadata
- Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect.
- Back in the Librarian Interface, click the Enrich tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with "ex.".
- Check whether the ex.Title metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.
- The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether.
Manually adding metadata to documents in a collection
- In the Enrich panel, manually add Dublin Core dc.Title metadata to those documents which have incorrect ex.Title metadata. Select word03.doc and double-click to open it. Copy the title of this document ("Greenstone: A comprehensive open-source digital library software system") and return to the Librarian Interface. Scroll up or down in the metadata table until you can see dc.Title. Click in the value box and paste in the metadata.
- Now add dc.Creator information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as dc.Creator metadata.
- Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur.
- Next add dc.Title and dc.Creator metadata for a few of the other documents.
- You will notice as you add more values, they appear in the Existing values for ... box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors.
If you build and preview your collection at this point, you will see that the titles list now shows your new Titles. However, the dc.Creator metadata is not displayed. You need to alter the collection design to use this metadata.
- In the Librarian Interface, look at the Document Plugins section of the Design panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the ZIPPlugin, TextPlugin, HTMLPlugin, EmailPlugin, PowerPointPlugin, ExcelPlugin, ImagePlugin, ISISPlug and NULPlugin plugins. To delete a plugin, select it and click <Remove Plugin>. GreenstoneXMLPlugin is required for any type of source collection and should not be removed.
- The next step in the Design panel is Search Indexes. These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the ex.Source index, which is not particularly useful, by selecting it and clicking <Remove Index>.
- By default the titles index (dc.Title,ex.dc.Title,ex.Title) includes dc.Title, ex.dc.Title and ex.Title. Searching this index will search dc.Title, ex.dc.Title and ex.Title metadata. If you wanted to restrict searching to just the manually added dc.Title metadata, you would edit this index and deselect ex.dc.Title and ex.Title from the list of metadata.
- You can add indexes based on any metadata. Add a new index based on dc.Creator by clicking <New Index>. Select dc.Creator in the list of metadata, and click <Add Index>.
- The Browsing Classifiers section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two List classifiers, based on dc.Title;Title and ex.Source metadata. These correspond to the titles and filenames buttons on the collection's access bar. Remove the ex.Source classifier by selecting it and clicking <Remove Classifier>.
- Now add an AZCompactList classifier for dc.Creator. Select AZCompactList from the Select classifier to add drop-down list and click <Add Classifier...>. A popup window for Configuring Arguments appears. Select dc.Creator from the metadata drop-down list and click <OK>.
- Switch to the Create panel, and build the collection.
Next, go to the Format panel, and select the Search section to the left. On the right, set the display text value for Index: dc.Creator to
- Press the <Preview Collection> button. Check that all the facilities work properly. There should be three full-text indexes, called text, titles, and creators. The titles list should display all the document Titles. The creators list should show one bookshelf for each author you have assigned as dc.Creator, and clicking on that bookshelf should take you to all the documents they authored.
The titles list shows all documents which have been assigned dc.Title metadata, or have automatically extracted ex.Title. For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as dc.Title. Specifying a list of metadata names in the classifier allows us to use both.
- If you have already done the Enhanced Word document handling exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make the metadata field read dc.Creator,ex.Creator.
Build the collection again and preview it. Now extracted Creators should appear in the creators list.
We will play around with the format statements and customize the outlook of this collection in the Formatting the Word and PDF collection exercise.