Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: A collection of Word and PDF files
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.86|3.08

Enhanced Word document handling

The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.

  1. In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.

Using Windows native scripting

  1. In the Librarian Interface, open up the reports collection. Switch to the Design panel and select the Document Plugins section on the left-hand side. Double click the WordPlugin plugin and switch on the windows_scripting option.

    In the Search Indexes section, check the section checkbox, if not already the case, to build the indexes on section level as well as document level.

  1. Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the titles list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure.

    The default behaviour for WordPlugin with windows_scripting is to section the document based on "Heading 1", "Heading 2", "Heading 3" styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles.

    Note, to view style information in Word 2003, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. (In Word 2007 and later, find the Change Styles button on the far right of the menu ribbon. Click on the tiny Expand icon to its bottom right to display the styles side bar.) Click on a section heading and the formatting information will be displayed in this side bar.

  1. Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. WordPlugin can be configured to use these styles instead of Heading 1, Heading 2 etc. Next we will configure WordPlugin to use the styles found in word05.doc.

Modes in the Librarian Interface

  1. The Librarian Interface operates in three modes. Go to FilePreferences...Mode and see the modes and what functionality they provide access to. Librarian is the default mode. Check that this is indeed the currently active mode.

Defining styles

  1. Open up word05.doc in Word (by double-clicking on it in the Gather pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:

  1. In the Document Plugins section of the Design panel, select WordPlugin and click <Configure Plugin...>. Four types of header can be set which are:

    • level1_header (level1Header1|level1Header2|...)
    • level2_header (level2Header1|level2Header2|...)
    • level3_header (level3Header1|level3Header2|...)
    • title_header (titleHeader1|titleHeader2|...)

    These header options define which styles should be considered as title, level 1, level 2 and level 3 styles.

    Ensure that the windows_scripting option is checked, and set the 4 header options to the values highlighted in the following (spaces in the Word styles are removed when converting to HTML styles, and these options must match the HTML styles):

    level1_header: (ChapterTitle|AppendixTitle)
    level2_header: SectionHeading
    level3_header: SubsectionHeading
    title_header : ManualTitle

    Once these are set, click <OK>.

  1. Close any documents that are still open in Word, as this can prevent the build process from completing correctly.

  1. Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured.

    If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles.

Removing pre-defined table of contents

  1. If you look at the HTML version word06.doc, you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. WordPlugin can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work.

  1. To remove the tables of contents and figures from word06.doc, switch on the delete_toc option in WordPlugin. Set the toc_header option to (MsoToc1|MsoToc2|MsoToc3|MsoTof|TOA). In this document, the table of contents and list of figures use these four style names. Click <OK>.

  1. Build and preview the collection. word06.doc should now have only one table of contents.

Extracting document properties as metadata

  1. When the windows_scripting option is set, word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the metadata_fields option.

  1. In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties for Word 2003. In Word 2007/2010, click the Word Icon on the top left, then choose Prepare → Properties. In Word 2013, File → Info; the Properties section is on the right.). They have Title, Author, Subject, and Keywords properties. WordPlugin can be configured to look for these properties and extract them.

  1. In the Design panel, under Document Plugins, configure WordPlugin once again. Switch on the configuration option metadata_fields. Set the value to the following (but make sure not to enter any trailing spaces)

    Title,Author<Creator>,Subject,Keywords<Subject>

    This will make WordPlugin try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata.

  1. Make sure you have closed all the documents that were opened, then rebuild the collection.

  1. Look at the metadata for the two documents again in the Enrich panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.


Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”