Greenstone tutorial exercises (2016)
Modified for Greenstone version: 2.87
If you are working from a Greenstone Tutorial CD-ROM, DVD or USB flash drive, the sample files for these exercises are in the folder sample_files; otherwise they can be downloaded from sourceforge.
The text sometimes uses Windows terminology, but the exercises work equally well on other systems if you make appropriate changes to the pathnames.
-
Working with a pre-packaged collection (UNAIDS)
- Installing a pre-packaged Greenstone collection
Browsing around a Greenstone collection
Searching within a Greenstone collection
Leaving the Greenstone digital library
Exercise: Use the UNAIDS collection to answer these questions
-
Working with a pre-packaged collection (Digital Libraries in Education)
- Installing a pre-packaged collection
Browsing around a Greenstone collection
Exercise: Read the Help page; then answer these questions
Exercise: Use the How to build a digital library collection to answer these questions.
-
Installing Greenstone
- Installing Greenstone on a Windows system
-
Updating a Greenstone installation
- Removing Greenstone from a Windows system
Reinstalling Greenstone on a Windows system
Amalgamating different Greenstone collections
-
Building a small collection of HTML files
- Running the Greenstone Librarian Interface
Starting a new collection
Adding documents to the collection
Building the collection
Viewing the extracted metadata
Viewing the internal links and external links
Setting up a shortcut in the Librarian interface
-
A simple image collection
- Adding Title and Description metadata
Change Format Features to display new metadata
Changing the size of image thumbnails
Adding a browsing classifier based on Description metadata
Creating a searchable index based on Description metadata
-
A collection of Word and PDF files
- Viewing the extracted metadata
Manually adding metadata to documents in a collection
Document Plugins
Search indexes
Browsing classifiers
-
Formatting the Word and PDF collection
- Tidying up the default format statement
Linking to the Greenstone version or original version of documents
Making bookshelves show how many items they contain
Displaying multi-valued metadata
Advanced multi-valued metadata
-
Processing newer versions of PDF with PDFBox
- Obtaining and installing the PDFBox extension for Greenstone
Turning on the PDFBox extension functionality in GLI
-
Enhanced PDF handling
- Modes in the Librarian Interface
Splitting PDFs into sections
Using image format
Using process_exp to control document processing (advanced)
Opening PDF files with query terms highlighted
-
Enhanced Word document handling
- Using Windows native scripting
Modes in the Librarian Interface
Defining styles
Removing pre-defined table of contents
Extracting document properties as metadata
-
Associated files: combining different versions of the same document together
- Associating one document with another
Linking to associated documents
-
Exporting a collection to CD-ROM/DVD
-
A large collection of HTML files—Tudor
- Extracting more metadata from the HTML
Looking at different views of the files in the Gather and Enrich panels
-
Enhanced collection of HTML files—Tudor
- Adding hierarchically-structured metadata and a Hierarchy classifier
Adding a hierarchical phrase browser (PHIND)
Partitioning the full-text index based on metadata values
Controlling the building process
-
Formatting the HTML collection—Tudor
-
Section tagging for HTML documents
-
Downloading files from the web
-
Pointing to documents on the web
-
Bibliographic collection
- Using fielded searching
Exploding the database
Reformatting the collection to use the exploded metadata
-
CDS/ISIS collection
-
Customization: macro files and stylesheets
- Collection specific customisation
Changing the colour of the page title and page text
Make your own Greenstone home page
How to determine which images to replace (advanced)
-
Looking at a multimedia collection
-
Building a multimedia collection
- Manually correcting metadata
Browsing by media type
Suppressing dummy text
Using AZCompactList rather than List
Making bookshelves show how many items they contain
Adding a Phind phrase browser
Branding the collection with an image
Using UnknownPlugin
Cleaning up a title browser using regular expressions
Using non-standard macro files
Using different icons for different media types
Changing the collection's background image
Building a full-size version of the collection
Adding an image collage browser
-
Scanned image collection
- Grouping documents by series title and displaying dates within each group
Browsing documents by Date.
Displaying scanned images and suppressing dummy text
Searching at page level
Tidying up search results
-
Advanced scanned image collection
- Adding another newspaper to the collection
XML based item file
Using process_exp to control document processing
Switching between images and text
-
Open Archives Initiative (OAI) collection
- Tweaking the presentation with format statements
-
Setting up your Greenstone OAI Server
- Validating the Greenstone OAI server
-
Downloading over OAI
- Downloading using the Librarian Interface
Downloading using the command line
Building the downloaded documents in GLI
-
Use METS as Greenstone's Internal Representation
-
Moving a collection from DSpace to Greenstone
- Adding indexing and browsing capabilities to match DSpace's
-
Moving a collection from Greenstone to DSpace
- Using Greenstone from the command line
-
Editing metadata sets
- Running GEMS
Creating a new metadata set
Adding a new element to a metadata set
-
Building and searching with different indexers
- Build with Lucene
Search with Lucene
Build with MGPP
Search with MGPP
Use search mode hotkeys with query term
A quick reference of the search mode hotkeys in MGPP
-
Incremental building of a collection
-
The Depositor
Enabling The Depositor
Setting a user group
Use the Depositor to do incremental addition
Batch addition with the Depositor
-
Incrementally building a collection using the command line
- Incrementally adding some additional new documents to a collection
Incrementally deleting some documents from a collection
Editing a document's text and metadata, and then incrementally rebuilding the collection
Incrementally indexing automatically
Working with a pre-packaged collection (UNAIDS)
Devised for Greenstone version: UNAIDS 2.0 CD-ROM
You will need the Greenstone UNAIDS CD-ROM
Installing a pre-packaged Greenstone collection
- On inserting the UNAIDS CD-ROM, for many computers installation will begin automatically. If not, "auto-run"—a configurable setting under Windows—is disabled on your computer and you need to double-click Setup.exe on the CD-ROM.
My Computer → UNAIDS20 → Setup.exe
- The InstallShield Wizard begins to install the UNAIDS pre-packaged collection. Select the English language and click <OK>.
- On the welcome screen, click the <Next> button.
- Choose Run from CD-ROM (standard) as the setup type. This is the default and is already selected. Then click <Next>.
- Click <Next> again to install the UNAIDS collection in the default folder, which is C:\Program Files\UNAIDS Library 2.0 [CD-ROM].
Installation Wizard copies the required files from CD-ROM to disk
- Click <OK> to confirm completion of UNAIDS collection (twice).
InstallShield quits—the UNAIDS Library is installed.
CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones.
Browsing around a Greenstone collection
- Launch the prebuilt library by clicking:
Start → All Programs → UNAIDS Library 2.0 [CD-ROM] → UNAIDS Library 2.0 (Standard Version).
To access Greenstone through the Local Library Server, it is sometimes necessary to turn off the proxy settings of the browser. Greenstone normally detects this and pops up a window alerting you to the problem.
- Click <Enter Library> in the dialog box and your browser (typically Internet Explorer by default) will display the Greenstone home page.
- Within the web browser, click titles a-z (in the centre of the navigation bar near the top of the page).
- Access the first book in the list of titles by clicking the book icon next to the title:
About UNAIDS.
- Use the scroll bar to view the full length of the page.
- In the table of contents near the top, click the page icon next to the heading Guiding principles of UNAIDS to view this section.
- Click the page icon next to the heading Global and local impact to view the next section.
This style of interaction can be continued to further expand and contract folders and switch to a different section.
- To fully expand the contents of this introduction chapter, click Expand Document or Chapter in the upper left portion of the page, under the picture of the document's front cover.
- You can return to the currently selected page of document titles by clicking the book icon next to the title of the book at the top of the table of contents (this signifies closing the book). You also get to the document titles using titles a-z in the navigation bar, in this case to the titles beginning with A-D.
If the table of contents is open at the top level—showing all the chapters—then clicking Expand Document or Chapter expands the full document. For long documents, which take some time to load in, Greenstone seeks confirmation for this action: clicking 'continue' loads the full document.
- Browse around and peruse some other documents in the collection.
Searching within a Greenstone collection
- Access the search page by clicking search in the navigation bar.
- In the query box under Search for chapters in any language which contain some of the words, enter the term gender then click <Begin Search>.
After a short pause, the web browser loads a fresh page showing the results of the search.
- Click the page icon for the first matching document in the result set (Five Year Implementation Review of the Vienna Declaration and Programme of Action) to view the document. Because the search was at the chapter level, you are taken directly to the matching chapter within the document.
- Experiment further with searching, and with the interface in general. For example, there is a detailed Help page. It contains a Preferences section through which you can control some search settings.
The Preferences options in the UNAIDS collection are intentionally minimalist. Most collections have a separate Preferences button that offers more features.
The home page of the UNAIDS library collection cycles through a sequence of front cover images, updated every 5 seconds or so. Clicking a particular image takes you directly to that document.
Leaving the Greenstone digital library
- There are two ways of leaving Greenstone:
- Exit from the Greenstone Software server. Click on the Greenstone Software in the task bar, then choose Exit from the Browser Selection and Settings menu (or click on the exit hotspot, the red cross at the top right). The Greenstone Software exits, but your web browser continues to run.
- Exit from your web browser. Leave your web browser in the usual way. The Greenstone server detects when you exit from the browser and generates a popup window that asks whether to close down the server as well. (The reason is that other people may be using Greenstone over the network, and should not be rudely terminated.)
Exercise: Use the UNAIDS collection to answer these questions
- How many publications are there in the collection?
- How many documents are there that mention Australia in the title?
- How many top-level subject categories are there?
- What does AAVP stand for?
- What does AIDS stand for?
- Considering lower case variants only, how many times does the word "condom" appear in the collection?
How many times for "condoms"?
- If case sensitivity does not matter, how many times does the word "condom" appear in the collection?
How many times for "condoms"?
- If word endings are ignored, how many times does "condom" and variants such as "condoms" appear in the collection?
- How many chapters contain some variations of the word "condom"?
Does this make it a useful search term?
- What year saw the first reported case of AIDS in New Zealand?
Working with a pre-packaged collection (Digital Libraries in Education)
Devised for Greenstone version: IITE Digital Libraries in Education CD-ROM
You will need the Greenstone Digital Libraries in Education CD-ROM
Installing a pre-packaged collection
- Insert your CD-ROM for the course Digital libraries in education into a Windows computer. If the installation process does not start up straightaway (because the AutoPlay feature is disabled on your computer), navigate to your CD-ROM/DVD drive (normally D:), open the folder prebuilt, and double click on Setup.exe.
- During installation you are offered a choice of folder to install in: we recommend the default, which is C:\GSDL.
- You are also presented with the option to run Greenstone from the CD-ROM or to copy the entire CD-ROM. We recommend the latter: please check the box that says Install all collection files. It will take at least a couple of minutes to copy the files across.
- Finally, the installer offers to install the Netscape browser for you. Do not request this except in the unlikely event that you do not already have a web browser on your computer.
CD-ROMs like this one that contain pre-packaged Greenstone collections do not include the full Greenstone software. Instead they embody a mini version of Greenstone that allows you to view the collection but not to build new ones.
Browsing around a Greenstone collection
- To run Greenstone, open the Windows Start menu, Programs, and select Greenstone, then the sub-menu item Digital Libraries in Education: then <Enter Library>.
- Click the Digital libraries in Education collection's icon. This takes you to the collection's home page, often called the "about" page.
The home page contains an access bar with buttons called search, contents, authors a-z, modules, and acronyms. This access bar is the key to finding information in any Greenstone collection.
- Click <authors a-z>. A list of bookshelf icons appears. Click the one called Marchionini, G. to see the two course readings by Gary Marchionini.
- One of these items is a PDF file and the other is an HTML file. Click them both in turn to open up the documents.
- Click the <contents> button in the access bar. This shows two bookshelves, one for this Study Guide and the other for the Course Readings. Choose one and look at what it contains.
- Clicking a bookshelf that is open closes it. Close the bookshelf you have just opened and then choose the other one and examine its contents.
- Click <acronyms> in the access bar and find the meaning of the acronym "LOM".
- Click <search> and search for the word "LOM". Check out the difference between searching text and searching titles (use the pull-down box on the search page).
- Click the collection icon Digital Libraries in Education at the top left. This takes you back to the collection's about page.Beneath the access bar on the collection's about page is a search box (just the same as the one that appears on the search page), a description of the collection under the heading About this collection, and instructions on how to find information in this collection.Above the access bar is the collection's icon, saying Digital Libraries in Education. On the right is an icon saying about, above which are three buttons, home, help, and preferences.
- Click <home>. This returns you to the Greenstone home page.
- Return to the collection (by clicking its icon), and click <help>. This gives more information about how to access the collection.
- Click <preferences>. This takes you to a page where you can change some of the settings.
- Now explore the collection by navigating freely around it. Click liberally: all images that appear on the screen are clickable. If you hold the mouse stationary over an image, most browsers will soon pop up a brief "mouse-over" message that tells you what will happen if you click. Experiment! Choose common words like "the" or "and" to search for—that should evoke some response, and nothing will break. (Note: unlike many search systems, Greenstone indexes all words, including these ones.)
Exercise: Read the Help page; then answer these questions
- What does this collection contain?
- Name five ways to navigate to a target document in this collection.
- How many documents in the collection are written by Erik Duval?
- Compare the number of times the words "he" and "she" appear in the collection.
- How many times does the word "metadata" appear in titles? In the text itself?
- What's the difference between a some and an all search?
- What does "MODS" stand for?
- How do you switch the interface from English to Russian? Does it stay in Russian when you go to the Greenstone home page?
- Find a search term that yields different results depending on whether you have ignore word endings or whole word must match set on the Preferences page.
- What's the difference between Graphical and Textual interface format (on the Preferences page)?
Exercise: Use the How to build a digital library collection to answer these questions.
- How many sentences contain the word education?
- What story from the School Journal collection is featured in the book?
- How many acronyms used in the book begin with the word Standard?
- What does tapu mean?
- How many times does the word library appear? The word libraries?
- How many times does Library appear with an initial capital letter?
- How many times does some derivative of the word form appear?
- Name an English poem that was probably written in about 1000 A.D.
- Who is Alan Kay?
- On what page is the first mention of some aspect of Chinese culture?
Most of these questions would be rather difficult to answer from the printed book.
Installing Greenstone
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.87
Installing Greenstone on a Windows system
There are various ways of getting Greenstone:
- From a UNESCO CD-ROM (version 2.70) (or FAO IMARK CD-ROM, but this is an earlier version 2.51)These CD-ROMs contain the Greenstone software, plus documented example collections, four language interfaces (English French Spanish Russian), the Export to CD-ROM package, the ImageMagick graphics package, the Java runtime environment, and an installer that installs all of these.
- From the IITE Digital Libraries in Education CD-ROM, or a Greenstone workshop CD-ROM
In addition to all the above software, these CD-ROMs contain the tutorial exercises and a set of sample files to be used for these exercises.CD-ROMs with Greenstone version 2.62 or earlier also include the Greenstone Language Pack, which gives reader's interfaces in many languages (currently about 40). This has its own installer which you have to invoke separately, after you have installed Greenstone.CD-ROMs with version 2.70 or later now come with reader's interfaces in all available languages. Textual images have been removed from the interface; they are now done using CSS (Cascading Style Sheets). The Greenstone Language Pack is no longer needed. Instead, these CD-ROMs come with the Classic Interface Pack, which contains the old text images for use with a backwards compatibility macro file.
All these CD-ROMs contain the full Greenstone software, which allows you to view collections and build new ones. They are not the same as CD-ROMs that contain a pre-packaged Greenstone collection, which only allow you to view that collection.
- From http://www.greenstone.org/download
Most people download the Windows distribution from http://www.greenstone.org/download, which contains the latest version of Greenstone. To avoid a single massive download the documented example collections can be downloaded separately. To reduce the download size these collections are distributed in unbuilt form and need to be built.There is also the set of sample files used in these exercises.
Most Greenstone CD-ROMs start the installation process as soon as they are inserted into the drive, assuming that the AutoPlay feature is enabled on your computer. If installation does not begin by itself, locate the file
setup.exe on the CD and double click it to start the installation process. (On the IMARK CD-ROM this file resides in the folder
software_tools → Greenstone). If you download Greenstone over the web, what you get is the installer—just double-click it.
If Greenstone has been installed on your computer before, you should completely remove the old version before installing a new one. (However, you need not remove any pre-packaged collections that you may have installed.) To do this, see
Updating a Greenstone installation.
Here is what you need to do to install Greenstone. Older versions of the installer follow much the same sequence but use slightly different wording.
- Select the language for this installation. We choose English
- Welcome to the Greenstone Digital Library Software Installer. It is recommended that you uninstall any previous installations of Greenstone2 before running this installer. Click <Next>
- License Agreement. Click <Accept>
- Choose location to install Greenstone. Leave at the default and click <Next>
- Components. Click the question mark button on the right of each component will display the description of this component in a popup window. Leave at the default (all components are selected) and click <Next>
- Enable administration pages. Read the description on this page, if you check to enable, click <Next> to set admin password. Choose a suitable password and click <Next> (If your computer will not be serving collections online, the password doesn't matter)
- Click <Install> to start the installation. Click <Show Details> to show the details of this installation
- Files are copied across
- Installation is complete.
To invoke the Greenstone Reader's Interface, go to the
Greenstone-2.87 item under
All Programs on the Windows
Start menu and select
Greenstone Server, once the server window is displayed click <
Enter Library>.
To invoke the Greenstone Librarian Interface, go to the same item and select
Librarian Interface (GLI).
Updating a Greenstone installation
Prerequisite:
Installing Greenstone
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.87
These tutorial exercises assume that you are using Greenstone 2.60 or above.
Before updating to a new version of Greenstone, ensure that the computer is not running the Greenstone Librarian Interface or the Greenstone local library server. Normally, quitting your web browser, or quitting the Librarian Interface, also quits the server.
Removing Greenstone from a Windows system
Completely remove the existing version before you install a new version of Greenstone.
- Ensure that you are not running Greenstone.
- If the installed Greenstone version is 2.81 and above, to remove the old version, go to the Greenstone home directory (eg. C:\Users\<username>\Greenstone2 by default, where <username> is your user name) and click Uninstall.bat. Otherwise, if the version is lower than 2.81, remove the old version by going to the Windows Control Panel (from the Settings item on the Start menu). Click Add or Remove Programs, select Greenstone Digital Library Software, and Remove it. (To do this you may need Windows "Administrator" privileges.)
- For version 2.81 and above, the uninstaller has an option for keeping all your Greenstone collections, leave it at default as selected. For versions lower than 2.81, at the end of the uninstallation procedure you will be asked whether you would like all your Greenstone collections to be removed: you should probably say No if you wish to preserve your work.
Occasionally, problems are encountered if older Greenstone installations are not fully removed. To clean up your system, move your Greenstone collect folder, which contains all your collections, to the desktop. Then check for the folder C:\Program Files\gsdl or C:\Program Files\Greenstone or C:\Users\<username>\Greenstone2 for version 2.81 and above, which is where Greenstone is usually installed, and remove it completely if it exists.
Reinstalling Greenstone on a Windows system
- The reinstallation procedure is exactly the same as the original installation procedure, described in Installing Greenstone. If you already have ImageMagick, you do not need to install it again.
There have been some superficial changes to the installation procedure in moving to Greenstone Version 2.60, because it uses a different installer program.
There is another important difference that you should be aware of: Versions 2.60 and above are installed in the folder Program Files\Greenstone, whereas prior versions were placed in the folder Program Files\gsdl (these are both default locations that you could have changed during installation.) When upgrading to Version 2.60, if you want to save existing collections you must explicitly move the contents of your collect folder from the old place to the new one. Future Greenstone versions will be installed in the new place, Program Files\Greenstone, so this problem will not happen again.
Amalgamating different Greenstone collections
- If you have previously installed the Greenstone Digital Library software in a non-standard place, you should amalgamate your collections by moving them from the collect folder in the old place into the folder Program Files\Greenstone\collect.
- If you have installed collections from pre-packaged Greenstone CD-ROMs, they reside in a different place: C:\GSDL\collect. To amalgamate these with your main Greenstone installation, move them into the folder Program Files\Greenstone\collect. The mini version of Greenstone that is associated with the pre-packaged collections is no longer necessary. To uninstall it, select Uninstall on the Greenstone menu of the Windows Start menu.
Building a small collection of HTML files
Sample files:
simple_html.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
You will need some HTML files, such as those in the simple_html folder in sample_files.
Running the Greenstone Librarian Interface
- Start the Greenstone Librarian Interface:
Start → All Programs → Greenstone-2.87 → Librarian Interface (GLI)
If you are using Windows Vista or Windows 7 and have installed Greenstone into C:\Program Files\Greenstone, a User Account Control dialog may appear as you try to start the Greenstone Librarian Interface, click <Yes> to continue. After a short pause a startup screen appears, and then after a slightly longer pause the main Greenstone Librarian Interface appears. (A command prompt is also opened in the background.)
Starting a new collection
- Start a new collection within the Librarian Interface:
File → New...
- You will create a collection based on a few HTML web pages from the Tudor collection.A window pops up. Fill it out with appropriate values—for example,
Collection title: Small HTML Collection
Description of content: A small collection of HTML pages.
Leave the setting for Base this collection on: at its default: -- New Collection --, and click <OK>.
- Next you must gather together the files that will constitute the collection. A suitable set has been prepared ahead of time in sample_files → simple_html → html_files. Using the left-hand side of the Librarian Interface's Gather panel, interactively navigate to the sample_files → simple_html folder.
Adding documents to the collection
- Now drag the html_files folder from the left-hand side and drop it on the right. The progress bar at the bottom shows some activity. Gradually, duplicates of all the files will appear in the collection panel. A popup may appear saying that geov2.js is an unrecognised filetype and can't be processed by GLI. Tick the checkbox to no longer see this message again.
You can inspect the files that have been copied by double-clicking on the folder in the right-hand side.
- Since this is our first collection, we won't complicate matters by manually assigning metadata or altering the collection's design. Instead we rely on default behaviour. So pass directly to the Create panel by clicking its tab.
Building the collection
- To start building the collection, click the <Build Collection> button.
- Once the collection has built successfully, a window pops up to confirm this. Click <OK>.
- Click the <Preview Collection> button to look at the end result. This loads the relevant page into your web browser (starting it up if necessary).
Viewing the extracted metadata
- Back in the Librarian Interface, click the Enrich tab to view the metadata associated with the documents in the collection.
- Presently there is no manually assigned metadata, but the act of building the collection has extracted metadata from the documents. Double click the html_files folder to expand its content. Then single-click aragon.html to display all its metadata in the right-hand side of the panel. The initial fields, starting "dc.", are empty. These are Dublin Core metadata fields for manually entered data.
- Use the scroll bar on the extreme right to view the bottom part of the list. There you will see fields starting "ex." that express the extracted metadata: for example ex.Title, based on the text within the HTML Title tags, and ex.Language, the document's language (represented using the ISO standard 2-letter mnemonic) which Greenstone determines by analyzing the document's text.
- Close the collection by clicking File → Close. This automatically saves the collection to disk.
Viewing the internal links and external links
- Hyperlinks in a Greenstone collection work like this: If the link is to a document that is also in the collection, clicking it takes you to that document in the collection. If the link is to a document that is not in the collection, clicking it takes you to that document on the web.Go back to the web browser and click the titles link near the top of the page. Open the file boleyn.html and look for the link to Katharine of Aragon (in the 5th paragraph of the Biography section). This links to a document inside the collection--aragon.html. View this document by clicking the link. For an external link, return to boleyn.html and click letters written by Anne (in the Primary Sources section). This takes you out on to the web. If you want a warning message to be displayed first, you can open Greenstone → etc → main.cfg file and uncomment the line cgiarg shortname=el argdefault=prompt (remove the # at the start of a line to uncomment it). Note, that if you are already browsing a collection, then you will need to go back to the home page and re-enter the collection or even clear your browser history to see this take effect (due to caching of the el argument). Alternatively, try restarting the Greenstone web server.
Setting up a shortcut in the Librarian interface
- To set up a shortcut to the source files, in the Gather panel navigate to the folder in your local file space that contains the files you want to use—in our case, the sample_files folder. Select this folder and then right-click it, and choose Create Shortcut from the menu. In the Name field, enter the name you want the shortcut to have, or accept the default sample_files. Click <OK>. Close all the folders in the file tree in the left-hand pane, and you will see the shortcut to your source files.
A simple image collection
Sample files:
images.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
In this tutorial, we create a new collection that is based on the configuration of another collection.
- In a file browser, locate the folder sample_files → images → image-e. Copy this entire folder into your Greenstone → collect folder.
- In the Librarian Interface, start a new collection (File → New...) called backdrop. Fill out the fields with appropriate information. For Base this collection on:, select the item Simple image collection from the pull-down menu.
When you base a collection on an existing one, it inherits all the settings of the old one, including which metadata sets (if any) the collection uses.
- Copy the images (avoid the README.TXT file) provided in sample_files → images into your newly-formed collection.
- Change to the Create panel and build the collection.
-
Preview the result.
- Click on Browse in the navigation bar to view a list of the photos ordered by filename and presented as a thumbnail accompanied by some basic data about the image. The structure of this collection is the same as Simple image collection, but the content is different.
- Back in the Librarian Interface, change to the Enrich panel and view the extracted metadata for Bear.jpg.
Adding Title and Description metadata
- We work with just the first three files (Bear.jpg, Cat.jpg and Cheetah.jpg) to get a flavour of what is possible. First, we need to add the Dublin Core metadata set which is not used in the Simple image collection collection. Click the <Manage Metadata Sets...> button beneath the Collection file tree. A new window pops up showing the metadata sets used by current collection. Click the <Add...> button to bring up another window showing the available metadata sets. Select the "Dublin Core Metadata Element Set" from the list and click <Add>. Click <Close> to return to the Enrich panel.First, set each file's dc.Title field to be the same as its filename but without the filename extension.Click on Bear.jpg so its metadata fields are available, then click on its dc.Title field on the right-hand side. Type in Bear.Repeat the process for Cat.jpg, Cheetah.jpg and so on.
- Add a description for each image as dc.Description metadata.What description should you enter? To remind yourself of a file's content, the Librarian Interface lets you open files by double-clicking them. It launches the appropriate application based on the filename extension, Word for .doc files, Acrobat for .pdf files and so on.Double-click Bear.jpg: on Windows, the image will normally be displayed by Windows Photo Viewer (although this depends on how your computer has been set up).Back in the Enrich pane, make sure that Bear.jpg is selected in the collection tree on the left hand side. Enter the text Bear in the Rocky Mountains as the value for the dc.Description field.Repeat this process for Cat.jpg and Cheetah.jpg, adding a suitable description for each.
- Go to the Create panel and click <Build Collection>. Once it has finished building, preview the collection. You will not notice anything new. That's because we haven't changed the design of the collection to take advantage of the new metadata.
Change Format Features to display new metadata
- Now we customize the collection's appearance. Go to the Format panel and select Format Features from the left-hand list.Leave the feature selection controls at their default values, so that All Features is selected for Choose Feature, and VList is selected as the Affected Component. In the HTML Format String, edit the text as follows:
- Change _ImageName_: to Title:
-
Change [Image] to [dc.Title]
-
After [dc.Title]<br> add Description: [dc.Description]<br>
Metadata names are case-sensitive in Greenstone: it is important that you capitalize "Title" and "Description" (and don't capitalize "dc").
- The new format statement is displayed in the list of assigned format statements. The first substitution alters the fragment of text that appears to the right of the thumbnail image, the second alters the item of metadata that follows it. The addition displays the description after the Title.
- Preview the collection by clicking the <Preview Collection> button. When you click on Browse in the navigation bar the presentation has changed to "Title: Bear" and so on. Each image's description should appear beside the thumbnail, following the title.
After the first three items, the Title and Description become blank because we have only assigned Dublin Core metadata to these first three. (To get a full listing you would enter all the metadata.)
Changes in the Format panel take place immediately and you can see the result straightaway by clicking the Preview Collection button. If you modify anything in the Gather, Enrich or Design panels, you will need to rebuild the collection.
Changing the size of image thumbnails
- Let's change the size of the thumbnail image and make it smaller. Thumbnail images are created by the ImagePlugin plug-in, so we need to access its configuration settings. To do this, switch to the Design panel and select Document Plugins from the list on the left. Double-click ImagePlugin to pop up a window that shows its settings. (Alternatively, select ImagePlugin with a single click and then click <Configure Plugin...> further down the screen). Currently most options are off, so standard defaults are used. Select thumbnailsize, set it to 50, and click <OK>.
-
Build and preview the collection.
- Once you have seen the result of the change, return to the Design panel, select the configuration options for ImagePlugin, and switch the thumbnailsize option off so that the thumbnail reverts to its normal size when the collection is re-built.
Adding a browsing classifier based on Description metadata
- Now we'll add a new browsing option based on the descriptions. In the Design panel, select Browsing Classifiers from the left-hand list. Set the menu item for Select classifier to add to List, then click <Add Classifier...>.
- A window pops up to control the classifier's options. Set the metadata option to dc.Description. Next, click the partition_type_within_level check box and choose none from the drop-down list. Click <OK>.
-
Build the collection, and preview it. Choose the new Descriptions link that appears in the navigation bar.
Only three items are shown, because only items with the relevant metadata (dc.Description in this case) appear in the list. The original browse list includes all photos in the collection because it is based on ex.Image, extracted metadata that reflects an image's filename, which is set for all images in the collection.
Creating a searchable index based on Description metadata
- Now we'll add an index so that the collection can be searched by descriptions. Switch to the Design panel and select Search Indexes from the left-hand list. Click the <New Index> button. Select dc.Description from the list of metadata to include in the index and click <Add Index>. Leave Indexing Levels at its default, "document".
- Switch to the Create panel, build the collection, then preview it. There is now a Search button in the navigation bar. As an example, search for the term "bear" in the Descriptions index (which is the only index at this point).
- To change the text that is displayed for the index (Descriptions), go to the Format panel back in the Librarian Interface. Select Search from the left-hand list. This panel allows you to change the text that is displayed on the search form. Change the Display text for the "dc.Description" index to "image descriptions" (or other suitable text). Press the <Preview Collection> button. In the browser, visit the collection's search page again. Your new text will appear in the search form. To change the text that is displayed for the index (Descriptions), go to the Format panel back in the Librarian Interface. Select Search from the left-hand list. This panel allows you to change the text that is displayed on the search form. Change the Display text for the "dc.Description" index to "image descriptions" (or other suitable text). Go back to the browser and reload the search page. Your new text will appear in the search form.
Note that if you use text instead of macros in the search metadata display text, you will need to do any translations yourself.
A collection of Word and PDF files
Sample files:
Word_and_PDF.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
You will need some source files like those in the sample_files → Word_and_PDF folder.
- Start a new collection called reports (File → New...) and base it on -- New Collection --.
- Copy all the .doc, .rtf, .pdf and .ps files from sample_files → Word_and_PDF → Documents into the collection. There are 9 files in all: you can select multiple files by clicking on the first one and shift-clicking on the last one, and drag them all across together. (This is the normal technique of multiple selection.)
- Switch to the Create panel, and build and preview the collection.
Viewing the extracted metadata
- Again, this collection contains no manually assigned metadata. All the information that appears—title and filename—is extracted automatically from the documents themselves. Because of this the quality of some of the title metadata is suspect.
- Back in the Librarian Interface, click the Enrich tab to view the automatically extracted metadata. You will need to scroll down to see the extracted metadata, which begins with "ex.".
- Check whether the ex.Title metadata is correct for some of the documents by opening them. You can open a document from the Librarian Interface by double clicking on it.
- The extracted Title metadata for some documents is incorrect. For example, the Titles for pdf01.pdf and word03.doc (the same document in different formats) have missed out the second line. The Title for pdf03.pdf has the wrong text altogether.
Manually adding metadata to documents in a collection
- In the Enrich panel, manually add Dublin Core dc.Title metadata to those documents which have incorrect ex.Title metadata. Select word03.doc and double-click to open it. Copy the title of this document ("Greenstone: A comprehensive open-source digital library software system") and return to the Librarian Interface. Scroll up or down in the metadata table until you can see dc.Title. Click in the value box and paste in the metadata.
- Now add dc.Creator information for the same document. You can add more than one value for the same field: when you press Enter in a metadata value field, a new empty field of the same type will be generated. Add each author separately as dc.Creator metadata.
- Close the document (in Microsoft Word) when you have finished copying metadata from it. External programs opened when viewing documents must be closed before building the collection, otherwise errors can occur.
- Next add dc.Title and dc.Creator metadata for a few of the other documents.
- You will notice as you add more values, they appear in the Existing values for ... box below the metadata table. If you are adding the same metadata value to more than one document, you can select it from this list. For example, pdf01.pdf and word03.doc share the same Title; and many documents have common authors.
If you build and preview your collection at this point, you will see that the Titles list now shows your new Titles. However, the dc.Creator metadata is not displayed. You need to alter the collection design to use this metadata.
Document Plugins
- In the Librarian Interface, look at the Document Plugins section of the Design panel, by clicking on this in the list to the left. Here you can add, configure or remove plugins to be used in the collection. There is no need to remove any plugins, but it will speed up processing a little. In this case we have only Word, PDF, RTF, and PostScript documents, and can remove the ZIPPlugin, TextPlugin, HTMLPlugin, EmailPlugin, PowerPointPlugin, ExcelPlugin, ImagePlugin, ISISPlug and NULPlugin plugins. To delete a plugin, select it and click <Remove Plugin>. GreenstoneXMLPlugin is required for any type of source collection and should not be removed.
Search indexes
- The next step in the Design panel is Search Indexes. These specify what parts of the collection are searchable (e.g. searching by title and author). Delete the ex.Source index, which is not particularly useful, by selecting it and clicking <Remove Index>.
- By default the titles index (dc.Title,ex.dc.Title,ex.Title) includes dc.Title, ex.dc.Title and ex.Title. Searching this index will search dc.Title, ex.dc.Title and ex.Title metadata. If you wanted to restrict searching to just the manually added dc.Title metadata, you would edit this index and deselect ex.dc.Title and ex.Title from the list of metadata.
- You can add indexes based on any metadata. Add a new index based on dc.Creator by clicking <New Index>. Select dc.Creator in the list of metadata, and click <Add Index>.
Browsing classifiers
- The Browsing Classifiers section adds "classifiers," which provide the collection with browsing functions. Go to this section and observe that Greenstone has provided two List classifiers, based on dc.Title;ex.Title and ex.Source metadata. These correspond to the Titles and Filenames buttons on the collection's access bar. Remove the ex.Source classifier by selecting it and clicking <Remove Classifier>.
- Now add an AZCompactList classifier for dc.Creator. Select AZCompactList from the Select classifier to add drop-down list and click <Add Classifier...>. A popup window for Configuring Arguments appears. Select dc.Creator from the metadata drop-down list and click <OK>.
- Switch to the Create panel, and build and preview the collection.
- Check that all the facilities work properly. There should be three full-text indexes, called Text, Titles, and Creators. The Titles list should display all the document Titles. The Creators list should show one bookshelf for each author you have assigned as dc.Creator, and clicking on that bookshelf should take you to all the documents they authored.
The Titles list shows all documents which have been assigned dc.Title metadata, or have automatically extracted ex.Title. For many documents, extracted Titles may be fine, and it is impractical to add the same metadata again as dc.Title. Specifying a list of metadata names in the classifier allows us to use both.
- If you have already done the Enhanced Word document handling exercise, some of the documents will have extracted ex.Creator metadata, and some will have dc.Creator. To use both of these in the Creators classifier, make the metadata field read dc.Creator,ex.Creator.
Build the collection again and preview it. Now extracted Creators should appear in the Creators list.
We will play around with the format statements and customize the outlook of this collection in the Formatting the Word and PDF collection exercise.
Formatting the Word and PDF collection
In this exercise, we play around with the format statements in the Word and PDF collection.
- Open the reports collection in the Librarian Interface and go to the Format Features section of the Format panel.
Tidying up the default format statement
- In this part of the exercise, we make the format statement simpler without changing the resulting display.Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections. For this collection, we don't need all of the complexity.Make sure that the VList format statement is selected in the list of formats.The default VList format statement looks like the following:
<td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
This format statement is the default used for any vertical list, such as search results, classifiers, and document table of contents.
{Or}{[ex.thumbicon],[ex.srcicon]} chooses ex.thumbicon metadata if it's there, otherwise chooses ex.srcicon metadata. If neither are present, nothing is displayed. For this collection there is no ex.thumbicon metadata so the choice is not needed.Replace {Or}{[ex.thumbicon],[ex.srcicon]} (highlighted above) with [ex.srcicon]. There is no exp.Title metadata, so remove that element from {Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}.The resulting format statement looks like the following:
<td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[ex.Title],Untitled}
[/highlight] {If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
Preview the collection to make sure the display hasn't changed. You shouldn't notice any difference when looking at search results, classifiers etc.
Linking to the Greenstone version or original version of documents
- For collections with documents that undergo a conversion process during importing (e.g. Word, PDF, PowerPoint documents, but not text, HTML documents), the original file is stored in the collection along with the converted version. The default VList format statement links to both versions:
[link][icon][/link] links to the Greenstone HTML version, while [ex.srclink][ex.srcicon][/ex.srclink] links to the original.Choose SearchVList in Format Features by selecting Search from the Choose Feature drop down list, and VList from the Affected Component list. Click <Add Format> to add the SearchVList format statement into the list of assigned formats. Experiment with removing either of the two links from the format statement.To see the results of your changes, preview the collection and do a search. You are making changes to SearchVList, which means the changes will only apply to search results.Storing and displaying the original allows users to see the correct format, but requires the user to have the relevant program installed. It also increases the size of the collection. The Greenstone version can be viewed in a browser, but may not look as nice.
Making bookshelves show how many items they contain
- Next, we'll customize the format statement for the Creators list. Classifier bookshelves have only a few pieces of metadata to display: ex.Title and numleafdocs. Whatever metadata the classifier has been built on, the bookshelf label is always stored as ex.Title. This is why a Creator is printed out for each bookshelf even though dc.Creator is not specified in the format statement.
[numleafdocs] is only defined for bookshelves, so this metadata can be used in an {If} statement to make bookshelves and documents display differently in the list.
Make each bookshelf in the Creator classifier show how many entries it contains. In the Format Features section of the Format panel, select the CL2 AZCompactList classifier (which is based on dc.Creator metadata) from the Choose Feature drop down list, and VList from the Affected Component list. Click the <Add Format> button to add this format into the list of assigned formats. Note that it gets added as CL2VList in this list: it is the VList format for the second (CL2) classifier.Append the following text to the bottom of the format statement:
{If}{[numleafdocs],<td><i>([numleafdocs])</i></td>}
Preview the collection. Click on the Creators list and notice that the bookshelves now display how many documents they contain.This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf.
Since only bookshelves define [numleafdocs], only they will display this. By modifying CL2VList instead of VList, the change will only apply to the second classifier (Creators).
Displaying multi-valued metadata
- Next we modify the document entries in the Creator classifier to display all authors. Back in Format Features, select the CL2VList format in the list of assigned formats. After {If}{[ex.Source],<br> in the format statement, add [sibling:dc.Creator].
[ex.Source] is not defined for bookshelves, so can also be used to differentiate bookshelves and documents.The resulting format statement looks like:
<td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink][ex.srcicon][ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[ex.Title],Untitled}[/highlight]
{If}{[ex.Source],<br>[sibling:dc.Creator]
<i>([ex.Source])</i>}</td>
{If}{[numleafdocs],<td><i>([numleafdocs])</i></td>}
This will display the Greenstone link, the link to the original, then the Title. For bookshelves, it will also display how many documents the bookshelf contains. For documents, it will display all the Authors (Creators), and the source document. [sibling:dc.Creator] displays all the Creator metadata for the document, separated by a space (" "), while [dc.Creator] displays only the first author. Preview the Creators list and make sure that all authors are displayed for documents.
- You can change the separator between the authors. Modify the format statement, and replace [sibling:dc.Creator] with [sibling(All'<br/>'):dc.Creator]. This will add a new line after each author (<br/> specifies a line break in HTML). Preview the Creators list.If you have done exercise Enhanced Word document handling, the collection will have both dc.Creator and ex.Creator metadata. To display both, you can use
[sibling:dc.Creator] [sibling:ex.Creator]
To display dc.Creator if it is present, otherwise display ex.Creator, use
{Or}{[sibling:dc.Creator],[sibling:ex.Creator]}
Advanced multi-valued metadata
- You may notice that the AZCompactList classifier's configuration dialog has two options after the metadata option: firstvalueonly and allvalues. Manually added metadata can be used to replace or enhance automatically extracted metadata, and these options control exactly which pieces of metadata a document is classified by.For example, say we have two documents. Document 1 has four Creators specified (dc.Creator = dcA, dc.Creator = dcB, ex.Creator = exA, ex.Creator = exB), while document 2 has three (ex.Creator = exA, ex.Creator = exB, ex.Creator = exC). The following table shows which metadata values each document is classified by, for the different classifier options:
AZCompactList options | Document 1 | Document 2 |
-metadata dc.Creator,ex.Creator | dcA, dcB | exA, exB, exC |
-metadata dc.Creator,ex.Creator -firstvalueonly | dcA | exA |
-metadata dc.Creator,ex.Creator -allvalues | dcA, dcB, exA, exB | exA, exB, exC |
- We'll now set the firstvalueonly option for the Creators classifier. Switch to the Browsing Classifiers section of the Design panel, select the AZCompactList for dc.Creator metadata in the Assigned Classifiers box and click <Configure Classifier...>. Select the firstvalueonly option.
Rebuild and preview the collection. Now the Creators list classifies documents based on the first author appearing in the dc.Creator metadata.If you set the metadata field of AZCompactList to dc.Creator,ex.Creator in the A collection of Word and PDF files exercise, now the Creators list will classify based on the first author appearing in either the dc.Creator metadata or the ex.Creator metadata.
Processing newer versions of PDF with PDFBox
By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to install the PDFBox extension for Greenstone and how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF.
- The wiki release notes that go with the Greenstone binary you downloaded will contain the download link to the PDFBox extension that works with your binary. If you want to try the most up-to-date version of the extension, copy the link http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.zip and paste it into the address bar of a browser window. Then download the zip archive from the page that loads, if you're in Windows. If you are working on a *nix machine, you might instead prefer to download the compressed tar file of the same by copying and pasting the link http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.tar.gz into your browser.
- Move the downloaded file into your Greenstone installation's ext folder.
- You will now need to decompress the file you downloaded in this location.To do so on Windows XP, rightclick on the file and choose Extract All... and go through the Extraction wizard. On Windows Vista and 7, double clicking on the zip file will open an Explorer window showing you its contents. Click on an empty part inside that window and choose Extract All... to extract its contents. On Linux, to decompress the tar.gz file, run the command:
tar -xvzf <tar file name>
All going well, you will have a folder called pdf-box inside your Greenstone's ext folder.
- Before you can use the extension, make sure that all instances of GLI, the Greenstone Librarian interface, are closed.
Note that if you were running GLI through a console, you will want to start up a fresh console, then run the setup script again to set up the Greenstone environment once more, which will this time take the presence of the PDFBox extension into account. To run the setup script, your console needs to be pointing to your Greenstone installation directory. From here, you would run setup.bat if you're on Windows, or source ./setup.bash if you're on Linux.
- Launch GLI once more, in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu.
- Create a new collection called newpdfs and drag and drop the PDF file in sample_files → pdfbox into here. The version of this PDF file is newer than what PDFPlugin can handle by default, but with the PDFBox extension installed, this file can now be processed. Also drag in the older PDF sample_files → Word_and_PDF → Documents → pdf03.pdf into the collection.
- Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension, go to the Design panel, select Document Plugins from the left, and on the right double click the PDFPlugin (alternatively, select this plugin and click the <Configure Plugin...> below) to open the dialog to configure this plugin. In the Configure Plugin... dialog, scroll down to the section AutoLoadConverters and select the checkbox next to the pdfbox_conversion option. Click OK to close the dialog, switch to the Create panel and build your collection. This time, the PDF files will be processed by PDFBox which will extract their text.Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the pdfbox_conversion option turned on.
Enhanced PDF handling
Sample files:
Word_and_PDF.zip
Devised for Greenstone version: 2.70|3.06
Modified for Greenstone version: 2.87|3.08
Greenstone converts PDF files to HTML using third-party software:
pdftohtml.pl. This lets users view these documents even if they don't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files is not so good. This exercise explores some extra options to the PDF plugin which may produce a nicer version for display.
- In the Librarian Interface, start a new collection called "PDF collection" and base it on -- New Collection --.In the Gather panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf.Go to the Create panel and build the collection. Examine the output from the build process. You will notice that one of the documents could not be processed. The following messages are shown: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 documents were processed and included in the collection. 1 was rejected".
- Preview the collection and view the documents. pdf05-notext.pdf does not appear as it could not be processed. pdf06-weirdchars.pdf was processed but looks very strange. The other PDF documents appear as one long document, with no sections.
Modes in the Librarian Interface
The Librarian Interface can operate in different modes. The default mode is Librarian mode. We can use Expert mode to work out why the pdf file could not be processed.
- Use the Preferences... item on the File menu, Mode tab, to switch to Expert mode and then build the collection again. The Create panel looks different in Expert mode because it gives more options: locate the <Build Collection> button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05-notext.pdf to HTML format". pdftohtml.pl cannot convert a PDF file to HTML if the PDF file has no extractable text.
- We recommend that you switch back to Librarian mode for subsequent exercises, to avoid confusion.
Splitting PDFs into sections
- In the Document Plugins section of the Design panel, configure PDFPlugin. Switch on the use_sections option. In the Search Indexes section, ensure that both the section and document boxes are checked. This will build the indexes on both the section level and the document level.
Build and preview the collection. View the text versions of some of the PDF documents. Note that these are now split into a series of pages, and a "go to page" box is provided.
The format is still a bit ugly though, and pdf05-notext.pdf is still not processed.
Using image format
- If conversion to HTML doesn't produce the result you'd like, PDF documents can be converted to a series of images, one per page. This requires ImageMagick and Ghostscript to be installed.
- In the Document Plugins section, configure PDFPlugin. Set the convert_to option to one of the image types, e.g. pagedimg_jpg. Switch off the use_sections option, as it is not used with image conversion.
-
Build the collection and preview.All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections, but each section displays "This document has no text.". For the conversion to images for PDF documents, no text is extracted.
- In order to view the documents properly, you will need to modify the format statement. In the Format Features section on the Format panel, select the DocumentText format statement. Replace
[Text]
with
[srcicon]
- Preview the collection. Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now.
In this collection, we only have PDF documents and they have all been converted to images. If we had other document types in the collection, we should use a different format statement, such as:
{If}{[parent:FileFormat] eq PDF,[srcicon],[Text]}
FileFormat is an extracted metadata item which shows the format of the source document. We can use this to test whether the documents are PDF or not: for PDF documents, display [srcicon], for other documents, display [Text].
Using process_exp to control document processing (advanced)
- Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.
- We achieve this by putting the problem files into a separate folder, and adding another PDFPlugin plugin with different options.
- Go to the Gather panel. Make a new folder called "notext": right click in the collection panel and select New folder from the menu. Change the Folder Name to "notext", and click <OK>.Move the two pdf files that have problems with html (pdf05-notext.pdf and pdf06-weirdchars.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files.
- Switch to the Document Plugins section of the Design panel. Add a second PDF plugin by selecting PDFPlugin from the Select plugin to add: drop-down list, and clicking <Add Plugin...>. This plugin will come after the first PDF plugin, so we configure it to process PDF documents as HTML. Set the convert_to option to html, and switch on the use_sections option. Click <OK>.
- Configure the first PDF plugin, and set the process_exp option to "notext.*\.pdf".
- The two PDF plugins should have options like the following:
plugin PDFPlugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"
plugin PDFPlugin -convert_to html -use_sections
The paged_img version must come earlier in the list than the html version. The process_exp for the first PDFPlugin will process any PDF files in the notext directory. The second PDFPlugin will process any PDF files that are not processed by the first one.Note that all plugins have the process_exp option, and this can be used to customize which documents are processed by which plugin.
- Edit the DocumentText format statement. PDF files processed as HTML will not have images to display, so we need to make sure they get text displayed instead. Change [srcicon] to {If}{[NoText] eq "1",[srcicon],[Text]}.
- Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. "bibliography"), but not the ones that were converted to images (try searching for "FAO" or "METS").
Opening PDF files with query terms highlighted
- Next we'll customize the SearchVList format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher, and currently only works on a Microsoft Windows platform.
- The search terms are kept in the macro variable _queryterms_, and we append #search="_queryterms_" to the end of a PDF file link to pass the query terms to the PDF.
PDFPlugin saves each PDF file in a unique directory. You can use
_httpcollection_/index/assoc/[archivedir]/[srclinkFile]
to refer to these files.
- Add SearchVList by selecting Search from the Choose Feature drop down list, and VList from the Affected Component list. Click <Add Format> to add the SearchVList format statement into the list of assigned formats. We need to test whether the file is a PDF file before linking to it, using {If}{[ex.FileFormat] eq 'PDF',,}. For PDF files, we use the above path format instead of the [ex.srclink] and [ex./srclink] variables to link to the file.The resulting format statement is:
<td valign="top">[link][icon][/link]</td>
<td valign="top">{If}{[ex.FileFormat] eq 'PDF', <a
href=\"_httpcollection_/index/assoc/[archivedir]/[srclinkFile]#search="_queryterms_"\">{Or}{[ex.thumbicon],[ex.srcicon]}</a>,
[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]}</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open with the query terms highlighted.
Enhanced Word document handling
The standard way Greenstone processes Word documents is to convert them to HTML format using a third-party program, wvWare. This sometimes doesn't do a very good job of conversion. If you are using Windows, and have Microsoft Word installed, you can take advantage of Windows native scripting to do a better job of conversion. If the original document was hierarchically structured using Word styles, these can be used to structure the resulting HTML. Word document properties can also be extracted as metadata.
- In your digital library, preview the reports collection. Look at the HTML versions of the Word documents and notice how they have no structure-they have been converted to flat documents.
Using Windows native scripting
- In the Librarian Interface, open up the reports collection. Switch to the Design panel and select the Document Plugins section on the left-hand side. Double click the WordPlugin plugin and switch on the windows_scripting option.In the Search Indexes section, check the section checkbox, if not already the case, to build the indexes on section level as well as document level.
-
Build the collection. You will notice that the Microsoft Word program is started up for each Word document—the document is saved as HTML from Word itself, to get a better conversion. Preview the collection. In the Titles list, notice that word03.doc and word06.doc now have a book icon, rather than a page icon. These now appear with hierarchical structure.The default behaviour for WordPlugin with windows_scripting is to section the document based on "Heading 1", "Heading 2", "Heading 3" styles. If you open up the word03.doc or word06.doc documents in Word, you will see that the sections use these Heading styles.Note, to view style information in Word 2003, you can select Format → Styles and Formatting from the menu, and a side bar will appear on the right hand side. (In Word 2007 and later, find the Change Styles button on the far right of the menu ribbon. Click on the tiny Expand icon to its bottom right to display the styles side bar.) Click on a section heading and the formatting information will be displayed in this side bar.
- Some of the documents do not use styles (e.g. word01.doc) and no structure can be extracted from them. Some documents use user-defined styles. WordPlugin can be configured to use these styles instead of Heading 1, Heading 2 etc. Next we will configure WordPlugin to use the styles found in word05.doc.
Modes in the Librarian Interface
- The Librarian Interface operates in three modes. Go to File → Preferences... → Mode and see the modes and what functionality they provide access to. Librarian is the default mode. Check that this is indeed the currently active mode.
Defining styles
- Open up word05.doc in Word (by double-clicking on it in the Gather pane), and examine the title and section heading styles. You will see that various user-defined header styles are set such as:
-
ManualTitle: Title of the manual
-
ChapterTitle: Level 1 section heading
-
SectionHeading: Level 2 section heading
-
SubsectionHeading: Level 3 section heading
-
AppendixTitle: Appendix section title
- In the Document Plugins section of the Design panel, select WordPlugin and click <Configure Plugin...>. Four types of header can be set which are:
- level1_header (level1Header1|level1Header2|...)
- level2_header (level2Header1|level2Header2|...)
- level3_header (level3Header1|level3Header2|...)
- title_header (titleHeader1|titleHeader2|...)
These header options define which styles should be considered as title, level 1, level 2 and level 3 styles. Ensure that the windows_scripting option is checked, and set the 4 header options to the values highlighted in the following (spaces in the Word styles are removed when converting to HTML styles, and these options must match the HTML styles):
level1_header: (ChapterTitle|AppendixTitle)
level2_header: SectionHeading
level3_header: SubsectionHeading
title_header : ManualTitle
Once these are set, click <OK>.
- Close any documents that are still open in Word, as this can prevent the build process from completing correctly.
-
Build the collection and preview it. Look in particular at word05.doc. You will see that this document is now also hierarchically structured.If you have documents with different formatting styles, you can use (...|...) to specify all of the different styles.
Removing pre-defined table of contents
- If you look at the HTML versions of word05.doc and word06.doc, you will see that it now has two tables of contents. One is generated by Greenstone based on the document's styles, the other was already defined in the Word document. WordPlugin can be configured to remove predefined tables of contents and tables of figures. The tables must be defined with Word styles in order for this to work.
- To remove the tables of contents and figures from word06.doc and the table of contents from word05.doc, switch on the delete_toc option in WordPlugin. Set the toc_header option to (MsoToc1|MsoToc2|MsoToc3|MsoTof|TOA). In this document, the table of contents and list of figures use these four style names. Click <OK>.
-
Build and preview the collection. Both word05.doc and word06.doc should now have only one table of contents.
Extracting document properties as metadata
- When the windows_scripting option is set, word document properties can be extracted as metadata. By default, only the Title will be extracted. Other properties can be extracted using the metadata_fields option.
- In the Enrich panel, look at the metadata that has been extracted for word05.doc and word06.doc. Now open the documents in Word and look at what properties have been set (File → Properties for Word 2003. In Word 2007/2010, click the Word Icon on the top left, then choose Prepare → Properties. In Word 2013, File → Info; the Properties section is on the right.). They have Title, Author, Subject, and Keywords properties. WordPlugin can be configured to look for these properties and extract them.
- In the Design panel, under Document Plugins, configure WordPlugin once again. Switch on the configuration option metadata_fields. Set the value to the following (but make sure not to enter any trailing spaces)
Title,Author<Creator>,Subject,Keywords<Subject>
This will make WordPlugin try to extract Title, Author, Subject and Keywords metadata. Title and Subject will be saved with the same name, while Author will be saved as Creator metadata, and Keywords as Subject metadata.
- Make sure you have closed all the documents that were opened, then rebuild the collection.
- Look at the metadata for the two documents again in the Enrich panel. You should now see ex.Creator and ex.Subject metadata items. This metadata can now be used in display or browsing classifiers etc.
Associated files: combining different versions of the same document together
This tutorial demonstrates how to link different versions of the same document together in Greenstone. As an example, two identical articles about Greenstone are used; one is in PDF format, the other in Word.
- Start a new collection called Associated Files Example, by selecting File → New. Enter an appropriate description for your collection.
- Copy the files pdf01.pdf and word03.doc provided in sample_files → Word_and_PDF → Documents into your new collection. Do this by dragging these files across from the filesystem view on the left of the Gather panel into the Collection view on the right.
- In the collection view, right-click on each file and select Rename, renaming them greenstone1.pdf and greenstone1.doc, respectively.
- In the Enrich panel, assign appropriate dc.Title and dc.Creator metadata to the documents. Since the contents are identical, you can select both documents and set metadata for them simultaneously.
Associating one document with another
- In Document Plugins, select the WordPlugin and press the <Configure Plugin...> button. In the resulting popup, scroll down to find the associate_ext option, and set this option to pdf. Now, for Word documents, Greenstone will look for documents with the exact same name but the PDF file extension. These PDFs will not be processed separately; instead, they will be associated with their equivalent Word documents. (Alternatively, you could make the PDF document the primary document, by setting the associate_ext option in the PDFPlugin to doc.)
- Build the collection. Notice that only one document was considered for processing and included in the collection. Since the PDF version of the document is an associated document, it is not processed.
Linking to associated documents
- Greenstone has internally associated the PDF version with the Word version of the document. However, with the default format statement, the end-user will have no idea that the PDF version exists. The collection built at this point (with default settings) only gives the user the choice of viewing either the Word version or the Greenstone-generated HTML version of the document. They are not given the option to view the PDF version.To allow users to view the PDF version of the document,change the default VList statement from this:
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
to:
<td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td>
Two things occur in this replacement. The main difference is the switch from using ex.srclink and ex.srcicon that provides the link to the primary source document (which is the Word document), and replace it with a hyperlink around an icon to the document that Greenstone has associated as an equivalent document (which is the PDF version). The icon Greenstone chooses to show is based on the filename extension of the matching file it has found. In this case .The second (more minor) change in this edit is to simplify the statement a bit. The original uses an {Or} statement to show a thumbnail version of the document, if Greenstone has one, in preference over the source icon. Since in this collection we have no thumbnails generated, it has been simplified by eliminating the {Or} combination and going straight to the ex.equivDocIcon metadata item.To make the change then, switch to the Format panel and edit the format statement for VList (All).Change:
<td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<i>([ex.Source])</i>}</td>
To:
<td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.equivDocLink][ex.equivDocIcon][ex./equivDocLink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[dc.Creator],: [sibling(All'\, '):dc.Creator]}</td>
Preview the collection.Note: When Greenstone encounters a file that matches the provided associate_ext value (pdf in our case), it sets the metadata value ex.equivDocIcon for that document to be the macro _iconXXX_, where XXX is whatever the filename extension is (so _iconpdf_ in our case). As long as there is an existing macro defined for that combination of the word icon and the filename extension, then a suitable icon will be displayed when the document appears in a VList. For pdf the displayed icon will be .
Exporting a collection to CD-ROM/DVD
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.87
Greenstone collections can be published on a self-installing CD-ROM/DVD that works on Windows.
- Launch the Greenstone Librarian Interface if it is not already running.
- Choose File → Write CD/DVD image.... In the resulting popup window, select the collection or collections that you wish to export by ticking their check boxes. You can optionally enter a name for the CD-ROM: this is the name that will appear in the menu when the CD-ROM is run. If a name is not entered, the default Greenstone Collections will be used. You can also specify whether the resulting CD-ROM will install files onto the host machine when used or not. Click <Write CD/DVD image> to start the export process.The necessary files for export are written to:
Greenstone → tmp → exported_xxx
where xxx will be similar to the name you have entered. If you didn't specify a name for the CD-ROM, then the folder name will be exported_collections.You need to use your own computer's software to write these on to CD-ROM. On Windows XP this ability is built into the operating system: assuming you have a CD-ROM or DVD writer insert a blank disk into the drive and drag the contents of exported_xxx into the folder that represents the disk.
The result will be a self-installing Windows Greenstone CD-ROM or DVD, which starts the installation process as soon as it is placed in the drive.
A large collection of HTML files—Tudor
Sample files:
tudor.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
You will need the files in the sample_files → tudor folder.
- Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the File menu), based on the default -- New Collection --.
- In the Gather panel, open the tudor folder in sample_files.
- Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection. (This material is from Marilee Hanson's Tudor England Collection at https://englishhistory.net/tudor/, distributed with her permission.)
- Switch to the Create panel and click <Build Collection>.
- When building has finished, preview the collection.
Extracting more metadata from the HTML
- The browsing facilities in this collection (Titles and Filenames) are based entirely on extracted metadata. Switch to the Enrich panel in the Librarian Interface and examine the metadata that has been extracted for some of the files.
- Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, Tools → Web Developer → Page Source in Mozilla). You will notice that this page has page_topic, content and author metadata.
- By default, HTMLPlugin only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the Design panel and select the Document Plugins section. Select the plugin HTMLPlugin line and click <Configure Plugin...>. A popup window appears. Switch on the metadata_fields option, and set the value to
Title,Author,Page_topic,Content
Click <OK>.
- Switch to the Create panel and rebuild the collection. Go back to the Enrich panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should now be visible.
Looking at different views of the files in the Gather and Enrich panels
- Switch to the Gather panel and on the right-hand side open englishhistory.net → tudor.
- Change the Show Files menu for the right-hand side from All Files to HTM & HTML. Notice the files displayed above are filtered accordingly, to show only files of this type.
- Change the Show Files menu to Images. Again, the files shown above alter.
- Now return the Show Files setting back to All Files, otherwise you may get confused later. Remember, if the Gather or Enrich panels do not seem to be showing all your files, this could be the problem.
Enhanced collection of HTML files—Tudor
We return to the Tudor collection and add metadata that expresses a subject hierarchy. Then we build a classifier that exploits it by allowing readers to browse the documents about Monarchs, Relatives, Citizens, and Others separately.
Adding hierarchically-structured metadata and a Hierarchy classifier
- Open up your tudor collection (the original version, not the webtudor version, in case you've already done that tutorial), switch to the Enrich panel and select the citizens folder (a subfolder of englishhistory.net → tudor). Set its dc.Subject and Keywords metadata to Tudor period|Citizens. The vertical bar ("|") is a hierarchy marker. Selecting a folder and adding metadata has the effect of setting this metadata value for all files contained in this folder, its subfolders, and so on. A popup alerts you to this fact. Click <OK> to close the popup.
- Repeat for the monarchs and relative folders, setting their dc.Subject and Keywords metadata to Tudor period|Monarchs and Tudor period|Relatives respectively. Note that the hierarchy appears in the Existing values for dc.Subject and Keywords area.If you don't want to see the popup each time you add folder level metadata, tick the Do not show this warning again checkbox; it won't be displayed again.
- Finally, select all remaining files—the ones that are not in the citizens, monarchs, or relative folders—by selecting the first and shift-clicking the last. Set their dc.Subject and Keywords metadata to Tudor period|Others and click outside the cell for the metadata to be assigned. This is done in a single operation (there is a short delay before it completes).When multiple files are selected in the left hand collection tree, all metadata values for all files are shown on the right hand side. Items that are common to all files are displayed in black—e.g. dc.Subject and Keywords—while others that pertain to only one or some of the files are displayed in grey—e.g. any extracted metadata.Metadata inherited from a parent folder is indicated by a folder icon to the left of the metadata name. Select one of the files in the relative folder to see this.
- Switch to the Design panel and select Browsing Classifiers from the left-hand list. Set the menu item for Select classifier to add to Hierarchy; then click <Add Classifier...>.
- A window pops up to control the classifier's options. Change the metadata to dc.Subject and Keywords and then click <OK>.
- For tidiness' sake, remove the classifier for Source metadata (included by default) from the list of currently assigned classifiers, because this adds little to the collection.
- Now switch to the Create panel, build the collection, and preview it. Choose the new Subjects link that appears in the navigation bar, and click the bookshelves to navigate around the four-entry hierarchy that you have created.
Adding a hierarchical phrase browser (PHIND)
Next we'll add an interactive hierarchical phrase browsing classifier to this collection.
- Switch to the Design panel and choose the Browsing Classifiers item from the left-hand list.
- Choose Phind from the Select classifier to add menu. Click <Add Classifier...>. A window pops asking for configuration options: leave the values at their preset defaults (this will base the phrase index on the full text) and click <OK>.
-
Build the collection again, preview it, and try out the new Phrases option in the navigation bar. An interesting PHIND search term for this collection is "king". Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing.
The Phind phrase browser is a Java applet. To be able to view applets in a browser, you will need a JRE installed and, from Java 7 onwards, will need to additionally add your Greenstone digital library home URL (http://localhost:8383 by default) to the Exception Site List via the Security tab of your Java Control Panel. We have found that installing web browsers before installing a JRE allows browsers to find your JRE and run applets. If you're installing browsers after the JRE has already been installed, then your browser should prompt you to install the JRE again when trying to view Java applets. For further information see http://java.com/en/download/help/enable_browser.xml on how to enable Java in a web browser and http://www.java.com/en/download/help/javaconsole.xml to locate the Java Control Panel for your operating system.
Partitioning the full-text index based on metadata values
Next we partition the full-text index into four separate pieces. To do this we first define four subcollections obtained by "filtering" the documents according to a criterion based on their dc.Subject and Keywords metadata. Then an index is assigned to each subcollection. This will enable users to restrict a search to a subset of the documents.
- Switch to the Design panel, and click Partition Indexes.
- Ensure that the Define Filters tab is selected (the default). Define a subcollection filter with name monarchs that matches against dc.Subject and Keywords, and type Monarchs as the regular expression to match with. Click <Add Filter>. This filter includes any file whose dc.Subject and Keywords metadata contains the word Monarchs.
- Define another filter, relatives, which matches dc.Subject and Keywords against the word Relatives. Define a third and fourth, citizens and others, which matches it against the words Citizens and Others respectively.
- Having defined the subcollection filters, we partition the index into corresponding parts. Click the Assign Partitions tab. Select the citizens subcollection and click <Add Partition>. Next select monarchs, and click <Add Partition>. Repeat for the other two subcollections, so that you end up with four partitions, one based on each subcollection filter.The order they appear in the Assigned Subcollection Partitions list is the order they will appear in the drop down menu on the search page. You can change the order by using the <Move Up> and <Move Down> buttons.
-
Build and preview the collection.
- The search page includes a pulldown menu that allows you to select one of these partitions for searching. For example, try searching the relatives partition for mary and then search the monarchs partition for the same thing.
- To allow users to search the collection as a whole as well as each subcollection individually, return to the Partition Indexes section of the Design panel and select the Assign Partitions tab. Select all four subcollections by either checking their boxes or press the Select All button, and click <Add Partition>.
- To ensure that the combined index appears first in the list on the reader's web page, use the <Move Up> button to get it to the top of the list here in the Design panel. Then build and preview the collection.
- Search for the term Mary again, as that is likely to be common in all five index partitions, and check that the numbers of words (not documents) add up.
- The text in the drop down box on the search page is based on the filters each partition was built on. To change the text that is displayed, go to the Search section of the Format panel. The single filter partitions have sensible default text, but the combined partition does not. Set the Display text for the combined partition to "all". Preview the collection.
Controlling the building process
Finally we look at how the building process can be controlled. Developing a new collection usually involves numerous cycles of building, previewing, adjusting some enrich and design features, and so on. While prototyping, it is best to temporarily reduce the number of documents in the collection. This can be accomplished through the maxdocs parameter to the building process.
- Switch to the Create panel, select Import Options on the left and view the options that are then displayed to the right. Select maxdocs and set its numeric counter to 3. (When in GLI's Expert Mode, the maxdocs option for the import process are located under the Import Options of the Create panel.) Now build.
- Preview the newly rebuilt collection's Titles page. Previously this listed more than a dozen pages per letter of the alphabet, but now there are just three—the first three files encountered by the building process.
- Go back to the Create panel and turn off the maxdocs option. Rebuild the collection so that all the documents are included.
Formatting the HTML collection—Tudor
- Open up your tudor collection, go to the Format panel (by clicking on its tab) and select Format Features from the left-hand list. Leave the editing controls at their default value, so that Choose Feature displays All Features and VList is selected as the Affected Component. The text in the HTML Format String box reads as follows:
<td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]} [ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
This displays something that looks like this:
| A discussion of question five from Tudor Quiz: Henry VIII
(quizstuff.html) |
for a particular document whose Title metadata is A discussion of question five from Tudor Quiz: Henry VIII and whose Source metadata is quizstuff.html.This format appears in the search results list, in the Titles list, and also when you get down to individual documents in the Subjects hierarchy. This is Greenstone's default format statement.
Greenstone's default format statement is complex because it is designed to produce something reasonable under almost any conditions, and also because for practical reasons it needs to be backwards compatible with legacy collections.
- Delete the contents of the HTML Format String box and replace it with this simpler version:
<td>[link][icon][/link]</td>
<td>[ex.Title]<br>
<i>([ex.Source])</i>
</td>
Preview the result (you don't need to build the collection, because changes to format statements take effect immediately). Look at some search results and at the Titles list. They are just the same as before! Under most circumstances this far simpler format statement is entirely equivalent to Greenstone's more complex default.
But there's a problem. Beside the bookshelves in the Subjects browser, beneath the subject appears a mysterious "()". What is printed for these bookshelves is governed by the same format statement, and though bookshelf nodes of the hierarchy have associated Title metadata—their title is the name of the metadata value associated with that bookshelf—they do not have ex.Source metadata, so it comes out blank.
- In the Format Features section of the Format panel, the Choose Feature menu (just above Affected Component menu) displays All Features. That implies that the same format is used for the search results, titles, and all nodes in the subject hierarchy—including internal nodes (that is, bookshelves). The Choose Feature menu can be used to restrict a format statement to a specific one of these lists. We will override this format statement for the hierarchical subject classifier. In the Choose Feature menu, scroll down to the item that says
CL2: Hierarchy -metadata dc.Subject and Keywords
and select it. This is the format statement that affects the second classifier (i.e., "CL2"), which is a Hierarchy classifier based on dc.Subject and Keywords metadata.Click <Add Format> to add this format statement to the collection.Edit the HTML Format String box below to read
<td>[link][icon][/link]</td>
<td>[ex.Title]</td>
-
Preview the Subjects list in the collection. First, the offending "()" has disappeared from the bookshelves. Second, when you get down to a list of documents in the subject hierarchy, the filename does not appear beside the title, because ex.Source is not specified in the format statement and this format statement applies to all nodes in the subject classifier. Note that the search results and titles lists have not changed: they still display the filename underneath the title.
- Let's change the search results format so that dc.Subject and Keywords metadata is displayed here instead of the filename. In the Choose Feature menu (under Format Features on the Format panel), scroll down to the item Search and select it. Click <Add Format> to add this format statement to the collection. Change the HTML Format String box below to readReplace the line:
<td>[link][icon][/link]</td>
<td>[ex.Title]<br>
[dc.Subject]
</td>
- To insert the [dc.Subject], position the cursor at the appropriate point and either type it in, or select it from the Insert Variable... drop down menu. This menu shows many of the things that you can put in square brackets in the format statement.
-
Preview the collection. Documents in the search results list will be displayed like this:
| A discussion of question five from Tudor Quiz: Henry VIII
Tudor period|Others |
(The vertical bar appears because this dc.Subject and Keywords metadata is hierarchical metadata. Unfortunately there is no easy way to get at individual components of the hierarchy. For most metadata, such as title and author, this isn't a problem.)
- Finally, let's return to the Subjects hierarchy and learn how to modify the bookshelves. In the Choose Feature menu, re-select the item
CL2: Hierarchy -metadata dc.Subject and Keywords
Edit the HTML Format String box below to read
<td>[link][icon][/link]</td>
<td>{If}{[numleafdocs],<b>Bookshelf title:</b> [ex.Title],
<b>Title:</b> [ex.Title]}
</td>
Again, you can insert the items in square brackets by selecting them from the Insert Variable... drop down box.
The If statement tests the value of the variable numleafdocs. This variable is only set for internal nodes of the hierarchy, i.e. bookshelves, and gives the number of documents below that node. If it is set we take the first branch, otherwise we take the second. Commas are used to separate the branches. The curly brackets serve to indicate that the If is special—otherwise the word "If" itself would be output.
-
Preview the collection and examine the subject hierarchy again to see the effect of your changes. Bookshelves should say Bookshelf title: and then the title, while documents will display Title: and the title. Note that the number of documents in the bookshelf is not displayed: we are using [numleafdocs] to test what kind of item in the list we are at, but we are not displaying it.
Section tagging for HTML documents
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.87|3.08
- In a browser, visit the Greenstone demo collection and have a look at it. Browse to one of the documents. This collection is based on HTML files, but they appear structured in the collection. This is because these HTML files were tagged by hand into sections.
- Using a text editor (e.g. WordPad) open up one of the HTML files from the demo collection:
Greenstone → collect → demo → import → fb33fe → fb33fe.htm
. You will see some HTML comments which contain section information for Greenstone. They look like:
<!--
<Section>
<Description>
<Metadata name="Title">Farming snails 1: Learning about snails;
Building a pen; Food and shelter plants</Metadata>
</Description>
-->
<!--
</Section>
<Section>
<Description>
<Metadata name="Title">Dew and rain</Metadata>
</Description>
-->
When Greenstone encounters a <Section> tag in one of these comments, it will start a new subsection of the document. This will be closed when a </Section> tag is encountered. Metadata can also be added for each section—in this case, Title metadata has been added for each section. In the browser, find the Farming snails 1 document in the demo collection (through the Titles browser). Look at its table of contents and compare it to the <Section> tags in the HTML document.
- Add a new Section into this document. For example, lets add a new subsection into the Introduction chapter. In the text editor, add the highlighted text just after the tag for the Introduction section:
<!--
<Section>
<Description>
<Metadata name="Title">Introduction</Metadata>
</Description>
-->
<!--
<Section>
<Description>
<Metadata name="Title">Snails are good to eat.</Metadata>
</Description>
-->
Then just before the next section tag (What do you need to start?), add the highlighted section:
<!--
</Section>
-->
<!--
<Section>
<Description>
<Metadata name="Title">What do you need to start?</Metadata>
</Description>
-->
Save the edited file and close it. The effect of these changes is to make a new subsection inside the Introduction chapter.
- Open the Greenstone demo collection in the Librarian Interface. In the Document Plugins section of the Design panel, note that HTMLPlugin has the description_tags option set. This option is needed when <Section> tags are used in the source documents.
-
Build and preview the collection. Look at the Farming snails 1 document again and check that your new section has been added.
Downloading files from the web
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
The Greenstone Librarian Interface's Download panel allows you to download individual files, parts of websites, and indeed whole websites, from the web.
- Start a new collection called webtudor, and base it on -- New Collection --.
- In a web browser, visit https://englishhistory.net, follow the link to The Tudors. You should be at the URL
This is where we started the downloading process to obtain the files you have been using for the tudor collection. You could do the same thing by copying this URL from the web browser, pasting it into the Download panel, and clicking the <Download> button. However, several megabytes will be downloaded, which might strain your network resources—or your patience! For a faster exercise we focus on a smaller section of the site.
- Go to the Download panel by clicking its tab. There are five download types listed on the left hand side. For this exercise, we only use the Web type. Make sure this is selected in the list.Enter this URL
into the Source URL box. There are several other options that govern how the download process proceeds. To see a description of an option, hover the mouse over it and a tooltip will appear. To copy just the citizens section of the website, switch on the Only files below URL option by checking its box and set the Download Depth option to 1. If you don't do this (or if you miss out the terminating "/" in the URL), the downloading process will follow links to other areas of the englishhistory.net website and grab those as well. Also switch on the Only files within site option to avoid downloading any items on the site pages that actually emanate from outside it (like google ads).
- If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the <Configure Proxy...> button. Switch on the Use proxy connection? checkbox. Enter the proxy server address and port number in the HTTP Proxy Host: and Port: boxes.URLs that start with https, or URLs that resolve to https, will additionally need the HTTP Proxy Host: and corresponding Port: filled in too, before web pages can be downloaded from there.Websites at https URLs often have a security certificate, but not always. For instance, https://englishhistory.net does not have one. To instruct GLI to nevertheless download pages from https URLs that don't have a security certificate, you'll also need to switch on the No certificate checking (effective on 'https' URLs) checkbox.Once you've finished configuring the proxy settings, click <OK> to close the dialog.
- Now click <Download>. If you have set proxy information in Preferences..., a popup will ask for your user name and password. If you're on Windows Vista or later, Windows may show a popup message asking whether you wish to block or unblock the download. In such a case, choose to unblock. With proxy settings turned on, it may take a short while before GLI starts downloading. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing.
More detailed information can be obtained by clicking <View Log>. The process can be paused and restarted as needed, or stopped altogether by clicking <Close>. Downloading can be a lengthy process involving multiple sites, and so Greenstone allows additional downloads to be queued up. When new URLs are pasted into the url box and <Download> clicked, a new progress bar is appended to those already present in the lower half of the panel. When the currently active download item completes, the next is started automatically.
- Downloaded files are stored in a top-level folder called Downloaded Files that appears on the left-hand side of the Gather panel. You may not need all the downloaded files, and you choose which you want by dragging selected files from this folder over into the collection area on the right-hand side, just like we have done before when selecting data from the sample_files folder. In this example we will include everything that has been downloaded.Select the englishhistory.net folder within Downloaded Files and drag it across into the collection area.
- Switch to the Create panel to build and preview the collection. It is smaller than the previous collection because we included only the citizens files. However, these now represent the latest versions of the documents.
Pointing to documents on the web
- Open up your tudor collection, and in the Gather panel inspect the files you dragged into it. The first folder is englishhistory.net, which opens up to reveal tudor, and so on. The files represent a complete sweep of the pages (and supporting images) that constitute the Tudor citizens section of the englishhistory.net web site. They were downloaded from the web in a way that preserved the structure of the original site. This allows any page's original URL to be reconstructed from the folder hierarchy.
- In the Design panel, select the Document Plugins section, then select the plugin HTMLPlugin line and click <Configure Plugin...>. A popup window appears. Locate the file_is_url option (about halfway down the first block of items) and switch it on. Click <OK>.Setting this option to the HTMLPlugin means that Greenstone sets an additional piece of metadata for each document called URL, which gives its original URL.It is important that the files gathered in the collection start with the web domain name (englishhistory.net in this case). The conversion process will not work if you dragged over a subfolder, for example the tudor folder, because this will set URL metadata to something like
http://tudor/citizens/...
rather than
https://englishhistory.net/tudor/citizens/...
If you had copied over a subfolder previously, delete it and make a fresh copy. Drag the folder in the right-hand side of the Gather panel on to the trash can in the lower right corner. Then obtain a fresh copy of the files by dragging across the englishhistory.net folder from the sample_files → tudor folder (or the Downloaded Files folder if you have done exercise Downloading files from the web) on the left-hand side.
- To make use of the new URL metadata, the icon link must be changed to serve up the original URL rather than the copy stored in the digital library. Go to the Format panel, select the Format Features section and edit the VList format statement by replacing
[link][icon][/link]
with
[weblink][webicon][/weblink]
- Switch to the Create panel and build and preview the collection. Note that the document icons have changed. Try clicking on boleyn.html. The collection behaves exactly as before, except that when you click a document icon your web browser retrieves the original document from the web (assuming it is still there by the time you do this exercise!). If you are working offline you will be unable to retrieve the document.
Bibliographic collection
Sample files:
marc.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
This exercise looks at using fielded searching in a collection. Fielded searching is best used for metadata rich collections. Here we use bibliographic data in MARC format.
- Start a new collection called Papers Bibliography which will contain a collection of example MARC records of the working papers published at the Computer Science Department, Waikato University. Enter the requested information and base it on -- New Collection --.
- In the Gather panel, open the sample_files → marc folder, drag CMSwp-all.marc into the right-hand pane and drop it there. A popup window asks whether you want to add MARCPlugin to the collection to process this file. Click <Add Plugin>, because this plugin will be needed to process the MARC records.
- Now select Browsing Classifiers within the Design panel and remove the default classifier for Source metadata.
- In the Search Indexes section, remove the ex.Source index. In this collection all records are from the same file, so ex.Source metadata, which is set to the filename, is not particularly interesting or useful.
- Switch to the Create panel, build the collection, and preview it. Browse through the Titles and view a record or two. Try searching—for example, find items that include graphics.
- Back in the Librarian Interface, go to the Browsing Classifiers section of the Design panel. Select AZCompactList from the Select classifier to add drop down menu, and click <Add Classifier...>. In the popup window, select dc.Subject and Keywords as the metadata item. Click <OK>.
AZCompactList is like List, except that terms that appear multiple times in the hierarchy are automatically grouped together and a new node, shown as a bookshelf icon, is formed.
-
Build the collection and preview the result.
Using fielded searching
- Now let's look at fielded searching. In the browser, go to the PREFERENCES page. You will notice that there is a Query style: option which enables you to switch between "normal" and "fielded" search. Change to fielded search now, press the set preferences, and click on the Search button to go back to the Search page. The search form has changed to a fielded form.
- You can specify which search form types are available for a particular collection, and which one is the default, using the searchType format statement. In the Format panel, select Format Features from the left-hand list. Select the searchType format statement from the list of assigned formats, and set the contents to form. This will make only fielded searching available for this collection.
Search type options include form and plain. You can specify one or both separated by a comma. If both are specified, the first one is used as the default: this is the one that the user will see when they first enter the collection.
-
Preview the collection again. Notice that the collection's home page no longer includes a query box. (This is because the search form is too big to fit here nicely.) To search, you have to click Search in the navigation bar. Note that the PREFERENCES page has changed so that the "normal" query style is no longer offered.
- Look at the search form in the collection. There are two fields that can be searched: text and titles. Add some more fields to search on by going back to the Librarian Interface.
- In the Design panel, go to the Search Indexes section. Add a new index based on dc.Subject and Keywords by clicking <New Index>, selecting dc.Subject and Keywords in the list of metadata, and clicking <Add Index>.
-
Rebuild the collection and preview the results. Notice the extra field in the ... in field drop-down menus in the search form. You can do quite complicated queries by searching for words in different fields at the same time.
- To change the text that is displayed in the drop-down menus of the search form, you would go to the Search section of the Format panel. Here you can change the display text for the indexes.
Exploding the database
- Go to the Enrich panel and try to see the metadata. It doesn't appear! This is because the metadata is associated with records inside the file, not the file itself.Metadata file types, such as MARC, CDS/ISIS, BibTex etc. can be imported into Greenstone but their metadata cannot be viewed in the Librarian Interface. To edit any metadata you need to go back to the program that created the file.Greenstone provides a way of exploding a metadata database so that each record appears as an individual document, with viewable and editable metadata. This process is irreversible: once this step has been done, the database is deleted and can no longer be used in its original program.
- In the Gather panel, you may notice that the MARC database has a different coloured icon to other files. A metadata database that can be exploded will be displayed with this green icon. Right-click on the file and choose Explode Metadata Database from the menu. A new window opens, containing options for the exploding process. A description of each option can be obtained by hovering the mouse over the option.
If it's not already on, turn on the metadata_set option by checking its box. This option indicates which metadata set to explode the metadata into. The default set is the "Exploded Metadata Set"—a metadata set which initially has no elements in it, but will receive a new element for each metadata field retrieved from the database.
- Click <Explode> to start the exploding process. This may take a short while, depending on the size of the database.
- Once exploding has finished, the MARC database file will have been deleted, and three folders created in its place. These folders contain an empty file for each record in the original database. The metadata for these records can be viewed and edited by switching to the Enrich panel.
- Because the MARC file is no longer present, and the collection contains empty (.nul) files, we need to change the list of plugins. In the Document Plugins section of the Design panel, remove MARCPlugin.
-
Rebuild and preview the collection. You will notice that the Subjects classifier is empty, searching no longer returns any results, and the document display is useless.Although the Titles classifier was built on ex.Title, it still displays the correct titles, but in the Enrich panel you can see the ex.Title metadata are actually the filenames rather than titles of the MARC records. This is because the default VList format uses the exp.Title metadata. In the Format Features section of the Format panel, select VList in the list of assigned format statements. The format statement looks like:
<td valign="top">[link][icon][/link]</td>
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign="top">[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
Since there is no dc.Title metadata and because exp.Title comes before ex.Title, the exploded titles will be displayed.
Reformatting the collection to use the exploded metadata
The collection previously used extracted (ex.) metadata, but now it uses exploded (exp.) metadata. The
Subjects classifier and search indexes were built on ex metadata, which is why they no longer work properly.
There is also no longer any text in the documents. Previously,
MARCPlugin stored the raw record as the "text" of each record. Now that the metadata is in the Librarian Interface, there is no longer the concept of raw record, and so there is no text.
We need to modify the collection design to take note of these changes.
- In the Search Indexes section, change the Title index to use exp.Title: select the Title index in the Assigned Indexes list and click <Edit Index>. Deselect dc.Title and ex.Title in the list of metadata, and select exp.Title. Click <Replace Index>.
- Remove the dc.Subject and Keywords index by selecting it in the Assigned Indexes list and clicking <Remove Index>. Add an index on exp.Subject: click <New Index>, select exp.Subject in the metadata list, and click <Add Index>.
- The text index is no longer any use, so remove that index too.
- To enable combined searching across all indexes at once, click <New Index>, tick the Add combined searching over all assigned indexes (allfields) checkbox, and click <Add Index>. Move this to the top of the list using the <Move Up> button, so that it appears first in the drop down list. Click <Set Default Index> on the right so that it becomes the default field for searching.
- To explicitly use the exp.Title metadata, in the Browsing Classifiers section, change the dc.Title;ex.Title List to use exp.Title metadata. Double click the dc.Title;ex.Title List in the Assigned Classifiers list, and change the metadata option to use exp.Title. Click <OK>. Do the same thing for the Subject AZCompactList, changing dc.Subject and Keywords to exp.Subject.
-
Rebuild and preview the collection. The classifiers should be back to normal and searching should now work.
-
In the Format Features section of the Format panel, select VList in the list of assigned format statements.
- There is no dc metadata for this collection, so replace {Or}{[dc.Title],[exp.Title],[ex.dc.Title],[ex.Title],Untitled} with {Or}{[exp.Title],[ex.Title],Untitled}.
- There are no source or thumb icons, so remove the second line:
<td valign="top">[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
- The ex.Source metadata is set to the nul filename, so remove that from the display. Remove:
{If}{[ex.Source],<br><i>([ex.Source])</i>}
The resulting format statement looks like:
<td valign="top">[link][icon][/link]</td>
<td valign="top">[highlight]
{Or}{[exp.Title],[ex.Title],Untitled}
[/highlight]</td>
- Clear the DocumentHeading format statement by selecting it in the list of assigned format statements and deleting the contents in the HTML Format String. The record Title will be displayed as part of the DocumentText format, so we don't need it here.
- Next, edit the DocumentText format statement. Delete the contents and replace it with the following (which can be copied from sample_files → marc → format_tweaks → document_text.txt).
<table>
<tr><td>Title:</td><td>[exp.Title]</td></tr>
<tr><td>Subject:</td><td>[exp.Subject]</td></tr>
<tr><td>Publisher:</td><td>[exp.Publisher]</td></tr>
</table>
- The DETACH and NO HIGHLIGHTING buttons are not very useful for this collection, so lets get rid of them. Edit the DocumentButtons format statement to make it empty.
Press the <Preview Collection> button to preview the collection and see how the document display has improved.
CDS/ISIS collection
Sample files:
isis.zip
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.87|3.08
This exercise is similar to the Bibliographic collection exercise, except that a CDS/ISIS database is used instead of a MARC database, and we do not explode the database.
- Start a new collection called ISIS Collection (base it on New Collection).
- Drag the files from sample_files → isis (excluding the format_tweaks folder and the README.txt file) into the collection.
-
Build and preview the collection. The default indexes, classifiers and formats are not very useful for this data. There is no metadata searching, and the Titles classifier is completely empty. The filenames classifier is useless because all records come from the same file.
- In the Search Indexes section of the Design panel, remove the useless Source and Title indexes, and add new indexes for ex.Photographer^all, ex.Country^all and ex.Notes^all metadata. In the Search section of the Format panel, you can set the display text for these indexes to "photographer", "country" and "notes".CDS/ISIS metadata has subfields, and these are represented using ^.
- In the Browsing Classifiers section of the Design panel, remove the existing (useless) classifiers for dc.Title;ex.Title and ex.Source, and add a new List for ex.Photographer.
-
Rebuild and preview the collection.
- In the Format Features section of the Format panel, change the VList format statement to display Photographer and Notes metadata. Change it to look like:
<td valign=top>[link][icon][/link]</td>
<td valign=top><b>[ex.Photographer^all]</b><br/>[ex.Notes^all]</td>
- Make fielded searching the default by changing the searchType format statement to form,plain (instead of plain,form).
ISISPlug stores a nicely formatted version of the record as the document text, and this is what is displayed when we view a record. Let's tidy it up a little more.
- Remove the DETACH and NO HIGHLIGHTING buttons by setting the DocumentButtons format statement to empty.
- Remove the "Untitled" at the top of the document by setting the DocumentHeading format statement to empty.
- We'll link to the raw record, which is stored as ISISRawRecord metadata.
Edit the DocumentText format statement to look like the following. (This format can be copied from sample_files → isis → format_tweaks → document_text.txt.)
<p>[Text]</p>
{If}{_cgiargshowrecord_,
<p><b>CDS Record:</b><br/><tt>[ISISRawRecord]</tt></p>
<center><a href=\'_gwcgi_?e=_cgiarge_&a=d&d=_cgiargd_\'>Hide CDS Record</a></center>,
<center><a href=\'_gwcgi_?e=_cgiarge_&a=d&d=_cgiargd_&showrecord=1\'>Show CDS Record</a></center>
}
- Preview the collection.
Customization: macro files and stylesheets
Sample files:
custom.zip
Devised for Greenstone version: 2.70
Modified for Greenstone version: 2.87
The appearance of all pages produced by Greenstone is governed by macro files, which reside in the folder
Greenstone → macros, and images and CSS stylesheets reside in
Greenstone → web → style.
A macro takes the form
_macroname_ {macro value}. Macro names start and end with underscores (_), and the macro value is enclosed in curly brackets ({}). Macro values can be text or HTML, and can include other macros.
Macros are grouped into packages, and different packages control the appearance of different pages. For example, the
home,
help,
preferences,
query,
document packages control the home, help, preferences, query, and document pages, respectively. Some macro files contain macros for just one package, for example,
home.dm,
query.dm,
document.dm, while others contain macros for many packages.
base.dm contains macros used globally,
style.dm controls the common style of each page,
english.dm,
french.dm and other language files contain the text fragments for the entire interface, in that language.
The output of the library program is a page of HTML which is viewed in a web browser. CSS (Cascading Style Sheets) are often used alongside HTML pages to control the formatting, such as layout, colour, font etc. The default Greenstone stylesheet is
Greenstone → web → style → style.css.
In this exercise, we customize the macros, images and stylesheets to change the appearance of our library.
Collection specific customisation
Macros can be used to customize single collections by adding them to a file called
extra.dm in the
macros directory of a collection.
We use the Word and PDF collection (from exercise
A collection of Word and PDF files) as the example for this exercise, but it can be done with any collection. Open up this collection (
reports) in the Librarian Interface.
- Go to the Format panel, and select Collection Specific Macros from the left hand list. This section allows you to edit the collection's extra.dm macro file.
- First, we change the title of the About this collection section of the about page. Add the following text in the edit box (which can be copied from the file about_tweak.txt in the sample_files → custom folder):
package about
_textabout_ {
<div class="section">
<h3>Very Interesting Reports Collection.</h3>
_Global:collectionextra_
</div>
}
Preview the collection by pressing the <Preview Collection> button. The About page will have a new title underneath the search form.
- Next we add a footer to each page. Add the _footer_ macro to the end of the edit box (which can be copied from the file footer_tweak.txt in the sample_files → custom folder):
package Style
_footer_ {
_pagefooterextra_
<center><small>Copyright 2010 My Awesome Digital Library</small></center>
_endspacer__htmlfooter_
}
The <center> and <small> HTML tags center the text, and make it a smaller size than the rest of the page.
- Preview the changes in a web browser. Each page should now have the new text at the bottom.
- Putting text in the main _footer_ macro adds it to all pages of this collection. To add a footer just to a particular page, use _pagefooterextra_ in the appropriate package. For example, lets add some more text to the footer, this time just on the About page.Add the following text immediately after the line
package about :
_pagefooterextra_ {Collection generated by Me.}
Preview the About page in a web browser. The About page should now display the new text, while the other pages won't.
- Next we'll do some style customisations. Add the following text below the _footer_ macro (which can be copied from the file red_tweak.txt in the sample_files → custom folder)
_collectionspecificstyle_ {
<style type="text/css">
/*clear the use of a background image */
body.bgimage \{ background-image: none; \}
/* set the background color to pink */
body.bgimage \{ background: pink; \}
/* clear the background image for the navigation bar, and set its color to red */
div.navbar \{ background-image: none; background-color: red; \}
a.navlink \{ background-image: none; background-color: red; \}
/* clear the background image for the divider bars, and set their color to red */
div.divbar \{ background-image: none; background-color: red; \}
</style>
}
/*...*/ around a line signals a comment, and this style element will be ignored.Preview the collection. The reports collection will now have a pink background, and the navigation bar and divider bars will be red. These changes will only affect this collection.
Any macros from the general macro files can be copied into a collection's
extra.dm file and modified. Remember to include the package declaration to make sure that the macros get applied to the correct page(s).
The style modifications made above were minor. The collection still uses the majority of the standard style file. The style declarations in the
_collectionspecificstyle_ macro get appended to the default ones. To completely change the appearance of a collection, we can use a new style sheet altogether.
- Add the following text (which can be copied from the file css_tweak.txt in the sample_files → custom folder) after the last modifications:
_cssheader_ {
<link rel="stylesheet" href="_httpcstyle_/style-blue.css" type="text/css"
title="Blue Style" charset="UTF-8">
}
Outside of the Librarian Interface, locate the collection folder Greenstone → collect → reports. Create a style folder inside this (if not already present), and copy the file sample_files → custom → style-blue.css into this folder.Preview the collection; the about page should look radically different. (If not, try restarting the Greenstone server and preview again.)
Changing the colour of the page title and page text
In the previous exercises we changed a single collection. Now we change all the pages in our Greenstone installation by modifying style and macro files outside the Librarian Interface. First, we format the page so that some other parts are blue. Preview any collection after each change to make sure that it has worked properly. On Windows, macro file changes may require a restart of the Greenstone local library server. Stylesheet changes may require a forced reload in the web browser.
Note, use any collection except the reports collection to preview the following changes. Because the reports collection has been modified to use its own custom stylesheet, changes to the main stylesheet won't have any effect on it.
- The majority of the style definitions reside in an external style file, Greenstone → web → style → style.css, and most style changes involve modifying that file. Open the style.css file in a text editor, e.g. WordPad (and save a .backup copy). Make the following modifications. You might want to preview after each one to see the effect.Change some of the colours:
Preview the collection. You may need to force the browser to reload the page to see the changes in effect, or else may need to restart the Greenstone server. Now text in the page body is a light green color (teal), and the font of the collection title has changed from black to blue.
(If a collection title image is used, you won't see the change on the collection title.)
- Let's switch the positions of the HOME, HELP and PREFERENCES buttons and the collection name or image.
- For div.pageinfo, set both float and text-align to left.
- For div.collectimage, set float and text-align to right.
The look of your library should now be substantially different. The HELP, HOME and PREFERENCES buttons are in the left upper corner whereas the collection title is switched to the right of the page. You will notice that the green boxes are now near the middle of the page. These are set in the style.dm file, and will be removed in Step 12.
- Now we will customize the default Greenstone header image and the background image. Two new images for this exercise can be found in sample_files → custom. Copy newbgimg.gif, newheadimg.gif from the custom folder into the Greenstone → web → images folder.
- Open the file Greenstone → macros → home.dm in a text editor. Find each occurrence of gsdlhead.gif in this file (there are two) and replace with newheadimg.gif. (If you are using WordPad, you can use Edit → Find to search for the text.)Save home.dm and close the file.
- Open the file Greenstone → macros → style.dm with the text editor. Locate the following part of the file (this is part of the _cssheader_ macro):
<style type="text/css">
body.bgimage \{ background-image: url("_httpimg_/chalk.gif"); scroll repeat-y left top; \}
Use copy and paste on the body.bgimage line to make it look like this:
<style type="text/css">
/*body.bgimage \{ background-image: url("_httpimg_/chalk.gif"); scroll repeat-y left top; \}*/
body.bgimage \{ background-image: url("_httpimg_/newbgimg.gif"); scroll repeat-y left top; \}
Here we are changing the background image for the bgimage section of the body of the page to newbgimg.gif.Near the bottom of the _cssheader_ macro, you will see these lines, which set the two green boxes mentioned in Step 9:
p.bannertitle \{background-image: url("_httpimages_/banner_bg.png"); \}
p.collectiontitle \{background-image: url("_httpimages_/banner_bg.png"); \}
Since we don't want these to appear anymore, we will simply comment out these lines like this:
/*p.bannertitle \{background-image: url("_httpimages_/banner_bg.png"); \}
p.collectiontitle \{background-image: url("_httpimages_/banner_bg.png"); \}*/
Save style.dm and close the file.
- Preview the home page in a web browser. On Windows, if forcing the browser to reload the pages does not show the changes in effect, restart the Greenstone library server before reloading the pages in the browser. On unix systems, you may need to reload pages in order to see the changes take effect, as the browser may at first display cached versions of pages you'd already visited earlier. The header of the home page, and the background of every page of each collection (except reports, which uses a custom _cssheader_ macro) should now use the new graphics, and the green box background images are no longer present on the collection pages.
Make your own Greenstone home page
You can make radical changes to a page by changing the macro file completely. For example, here we use an alternative to the home page which we have prepared for you in advance and included in your Greenstone installation.
- Open the file Greenstone → etc → main.cfg in a text editor. Locate the macrofiles list:
# The list of display macro files used by this receptionist
macrofiles tip.dm style.dm base.dm query.dm help.dm pref.dm about.dm \
document.dm browse.dm status.dm authen.dm users.dm html.dm \
extlink.dm gsdl.dm extra.dm home.dm collect.dm docs.dm \
bsummary.dm gti.dm gli.dm nav_css.dm usability.dm \
...
Change the text home.dm to yourhome.dm. Save and close the file.
- Preview the newly structured home page in a web browser. (On Windows, force reload the browser or else restart the Greenstone library server before reloading the pages in the browser.) Look at the file macros/yourhome.dm in a text editor to see how these changes are expressed.
- Reverse this last change by changing yourhome.dm back to home.dm in the file Greenstone → etc → main.cfg. You may also like to reverse the other changes you have made.
The final part of this exercise looks at how we determined which images needed replacing, and which macro files should be edited.
How to determine which images to replace (advanced)
- In step 10 of this exercise we replaced the default background (chalk.gif) and header (gsdlhead.gif) images with new ones. To do this we needed to change the image names in the macro files. How did we know which images we were replacing and which macro files to edit? This exercise shows you how to find out.
- To find out the names of the images to replace, go to the home page of your digital library in a browser. Right-click on the header image ("Greenstone digital library software") and select "Save picture as". A dialog will pop up and will display the image name: gsdlhead.gif (or newheadimg.gif if you are using the new header). Click Cancel to close the dialog—you don't need to save the images. Do the same for the background image by right clicking on the green (or blue or orange) swirly bar to the left. This time choose "Save background as" to find the name: chalk.gif (or newbgimg.gif), then click Cancel.These instructions apply to Internet Explorer. Other browsers may have other options in the right-click menu. For example, Mozilla provides "View Image" and "View Background Image" options. Using these options will put the path to the image in the browser address box, and the name can be seen from this.
- Once you have identified the names of the images to be replaced, you need to find out where they occur in the macro files. To do this on Windows, you would search the macro files for the image names using the findstr program, which is run in a command prompt. Open a command prompt using Start → Programs → Accessories → Command Prompt, or Start → Run and enter cmd as the name of the program to run. If your Windows doesn't have a conventional Start menu, then press Ctrl+r to launch the Windows Run dialog, then type cmd.You can type findstr /? to see a description of the program and its arguments.To search the macro files for gsdlhead.gif type
findstr /s /m /C:"gsdlhead.gif" "C:\Program Files\Greenstone\macros\*.dm"
*.dm means all files ending in .dm (while /s tells it to search within subfolders and /m lists the files that matched). A list of all applicable macro files will be displayed, along with any matches. You will see that home.dm and exported_home.dm both contain gsdlhead.gif. home.dm is the one you want to edit—exported_home.dm is used for the home page when you export a collection to CD-ROM. On Linux systems, the equivalent command to run in a terminal would be:
fgrep -rl "gsdlhead.gif" /full/path/to/your/greenstone/macros/
Do the same thing for chalk.gif:
findstr /s /m /C:"chalk.gif" "C:\Program Files\Greenstone\macros\*.dm"
base.dm and style.dm are the only files that mention this image.Close the command prompt.
Looking at a multimedia collection
Sample files:
beatles.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
- Copy the entire folder
sample_files → beatles → advbeat_large
(with all its contents) into your Greenstone collect folder. If you have installed Greenstone in the usual place, this is
My Computer → Local Disk (C:) → Users → <Username> → Greenstone → collect
where <Username> is the username under which Greenstone is installed.Put advbeat_large in there.
- On Windows, if the Greenstone Digital Library Local Library Server is already running, re-start it by clicking the world icon on the task bar and then pressing Restart Library. On Linux and Mac, just do a forced reload/refresh of the web browser (eg. by pressing Shift and the refresh button in Firefox to do a forced reload). If the Local Library Server hasn't been started yet, start it up first by selecting Greenstone Digital Library from the Start menu on Windows, or run ./gs-server.sh on Linux and Mac.
- Explore the Beatles collection. Note how the Browse button divides the material into seven different types. Within each category, the documents have appropriate icons. Some documents have an audio icon: when you click these you hear the music (assuming your computer is set up with appropriate player software). Others have an image thumbnail: when you click these you see the images.
- Look at the Titles browser. Each title has a bookshelf that may include several related items. For example, Hey Jude has a MIDI file, lyrics, and a discography item.
- Observe the low quality of the metadata. For example, the five items under A Hard Day's Night (under "H" in the Titles browser) have different variants as their titles. The collection would have been easier to organize had the metadata been cleaned up manually first, but that would be a big job. Only a tiny amount of metadata was added by hand—fewer than ten items. The original metadata was left untouched and Greenstone facilities were used to clean it up automatically. (You will find in Building a multimedia collection that this is possible but tricky.)
- In the file browser, take a look at the files that make up the collection, in the
sample_files → beatles → advbeat_large → import
folder. What a mess! There are over 450 files under seven top-level sub-folders. Organization is minimal, reflecting the different times and ways the files were gathered. For example, html_lyrics and discography are excerpts of web sites, and images contains various images in JPEG format. For each type, drill down through the hierarchy and look at a sample document.
Building a multimedia collection
We will proceed to reconstruct from scratch the Beatles collection that you have just looked at. We develop the collection using a small subset of the material, purely to speed up the repeated rebuilding that is involved.
- Start a new collection (File → New...) called small beatles, basing it on the default -- New Collection --. (Basing it on the existing Advanced Beatles collection would make your life far easier, but we want you to learn how to build it from scratch!)
- Copy the files and folders provided in
sample_files → beatles → advbeat_small
into your new collection. Do this by opening up advbeat_small, selecting the eight items within it (from discography to beatles_midi.zip), and dragging them across. Because some of these files are in MP3 and MARC formats you will be asked whether to include MP3Plugin and MARCPlugin in your collection. Click <Add Plugin>.A window may pop up explaining that the import documents contain css files, which none of Greenstone's plugins are expected to process directly. CSS files normally belong to a web page and we don't need to process them directly. Click <OK> button.
- Change to the Enrich panel and browse around the files. There is no metadata—yet. Recall that you can double-click files to view them.(There are no MIDI files in the collection: these require more advanced customisation because there is no MIDI plugin. We will deal with them later.)
- Change to the Create panel and build the collection.
-
Preview the result.
Manually correcting metadata
- You might want to correct some of the metadata—for example, the atrocious misspelling in the titles "MAGICAL MISTERY TOUR." These documents are in the discography section, with filenames that contain the same misspelling. Locate one of them in the Enrich panel. Notice that the extracted metadata element ex.Title is now filled in, and misspelt. You cannot correct this element, for it is extracted from the file and will be re-extracted every time the collection is re-built.
- Instead, add dc.Title metadata for these two files: "Magical Mystery Tour." In the Enrich panel, open the discography folder and drill down to the individual files. Set the dc.Title value for the two offending items.
-
Build the collection again, and preview it.Extracted metadata is unreliable. But it is very cheap! On the other hand, manually assigned metadata is reliable, but expensive. The previous section of this exercise has shown how to aim for the best of both worlds by using extracted metadata but correcting it when it is wrong.
Browsing by media type
- First let's remove the List classifier for filenames, which isn't very useful, and replace it with a browsing structure that groups documents by category (discography, lyrics, audio etc.). Categories are defined by manually assigned metadata.
- Change to the Enrich panel, select the folder discography and set its dc.Format metadata value to "Discography". Setting this value at the folder level means that all files within the folder inherit it.
- Repeat the process. Assign "Lyrics" to the html_lyrics folder, "Images" to images, "MARC" to marc, "Audio" to mp3, "Tablature" to tablature_txt, and "Supplementary" to wordpdf.
- Switch to the Design panel and select the Browsing Classifiers section.
- Delete the ex.Source classifier (the second one).
- Add a List classifier and select dc.Format as the metadata field. Click the bookshelf_type and select always in the drop-down list. Click the partition_type_within_level check box and choose none from the drop-down list. Click the sort_leaf_nodes_using checkbox, and select ex.Title in the drop-down list: this will make the classifier display documents in alphabetical order of title. Specify browse as the buttonname.
Build the collection again and preview it.
Note how we assigned dc.Format metadata to all documents in the collection with a minimum of labour. We did this by capitalizing on the folder structure of the original information. Even though we complained earlier about how messy this folder structure is, you can still take advantage of it when assigning metadata.
Suppressing dummy text
- Alongside the Audio files there is an MP3 icon, which plays the audio when you click it, and also a text document that contains some dummy text. Image files also have dummy documents. These dummy documents aren't supposed to be seen, but to suppress them you have to fiddle with a format statement.
- Change to the Format panel and select the Format Features section.
- Ensure that VList is selected, and make the changes that are highlighted below. You need to insert five lines into the first line, and delete the second line. (Note, the changes are available in a text file, see below.) Change:
<td valign=top>[link][icon][/link]</td>
<td valign=top>[ex.srclink]{Or}{[ex.thumbicon],[ex.srcicon]}[ex./srclink]</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
to this:
<td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary',
[srclink][srcicon][/srclink] [link][icon][/link],[link][icon][/link]}}}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
To make this easier for you we have prepared a plain text file that contains the new text. In WordPad open the following file:
sample_files → beatles → format_tweaks → audio_tweak.txt
(Be sure to use WordPad rather than Notepad, because Notepad does not display the line breaks correctly.) Place it in the copy buffer by highlighting the text in WordPad and selecting Edit → Copy. Now move back to the Librarian Interface, highlight all the text that makes up the current VList format statement, and use Edit → Paste (ctrl-v) to transform the old statement to the new one.
Preview the result. You may need to click the browser's <Reload> button to force it to re-load the page.
- While we're at it, let's remove the source filename from where it appears after each document.
- In the VList format feature, delete the text that is highlighted below:
<td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary',
[srclink][srcicon][/srclink] [link][icon][/link],
[link][icon][/link]}}}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]{If}{[ex.Source],<br><i>([ex.Source])</i>}</td>
Preview the result (you don't need to rebuild the collection.)
Using AZCompactList rather than List
- There are sometimes several documents with the same title. For example, All My Loving appears both as lyrics and tablature (under ALL MY LOVING). The Titles browser might be improved by grouping these together under a bookshelf icon. This is a job for an AZCompactList. In a previous tutorial we showed how to use the bookshelf_type option in List classifier to group documents with the same metadata value (dc.Format in that case) in one bookshelf. Here we use AZCompactList instead.
- Change to the Design panel and select the Browsing Classifiers section.
- Remove the dc.Title;Title classifier (at the top)
- Add an AZCompactList classifier, and enter dc.Title,ex.Title as its metadata.
- Finish by pressing <OK>.
- Move the new classifier to the top of the list (<Move Up> button).
Build the collection again and preview it. Both items for All My Loving now appear under the same bookshelf. However, many entries haven't been amalgamated because of non-uniform titles: for example A Hard Day's Night appears as several different variants. We will learn below how to amalgamate these.
Making bookshelves show how many items they contain
- Make the bookshelves show how many documents they contain by inserting a line in the VList format statement in the Format Features section of the Format panel. The added line is shown highlighted below. The complete format statement can be copied from sample_files → beatles → format_tweaks → show_num_docs.txt.
<td valign=top>
{If}{[dc.Format] eq 'Audio',
[srclink][srcicon][/srclink],
{If}{[dc.Format] eq 'Images',
[srclink][thumbicon][/srclink],
{If}{[dc.Format] eq 'Supplementary',
[srclink][srcicon][/srclink] [link][icon][/link],
[link][icon][/link]}}}</td>
<td>{If}{[numleafdocs],([numleafdocs])}</td>
<td valign=top>[highlight]
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
[/highlight]</td>
Preview the result (you don't need to build the collection.) Bookshelves in the titles and browse classifiers should show how many documents they contain.
Adding a Phind phrase browser
- In the Browsing Classifiers section on the Design panel, add a Phind classifier. Leave the settings at their defaults: this generates a phrase browsing classifier that sources its phrases from Title and text.
Build the collection again and preview it. Select the new Phrases option from the navigation bar. Enter a single word in the text box, such as band. The phrase browser will present you with phrases found in the collection containing the search term. This can provide a useful way of browsing a very large collection. Note that even though it is called a phrase browser, only single terms can be used as the starting point for browsing.
Branding the collection with an image
- To complete the collection, lets give it a new image for the top left corner of the page. Go to the General section of the Format panel. Use the browse button of URL to 'about page' image: to select the following image:
sample_files → beatles → advbeat_large → images → beatlesmm.png
Preview the collection, and make sure the new image appears.
Using UnknownPlugin
In this section we incorporate the MIDI files. Greenstone has no MIDI plugin (yet). But that doesn't mean you can't use MIDI files!
-
UnknownPlugin is a useful generic plugin. It knows nothing about any given format but can be tailored to process particular document types—like MIDI—based on their filename extension, and set basic metadata.In the Document Plugins section of the Design panel:
- add UnknownPlugin;
- activate its process_extension field and set it to "mid" to make it recognize files with extension .mid;
- Set file_format to "MIDI" and mime_type to "audio/midi".
In this collection, all MIDI files are contained in the file beatles_midi.zip. ZIPPlugin (already in the list of default plugins) is used to unpack the files and pass them down the list of plugins until they reach UnknownPlugin.
-
Build the collection and preview it. Unfortunately, the MIDI files don't appear as Audio under the browse button. That's because they haven't been assigned dc.Format metadata.
- Back in the Enrich panel, click on the file beatles_midi.zip and assign its dc.Format value to "Audio"—do this by clicking on "Audio" in the Existing values for dc.Format list. All files extracted from the Zip file inherit its settings.
Cleaning up a title browser using regular expressions
We now clean up the Titles browser.
- We are going to use the removesuffix classifier option. The aim is to amalgamate variants of titles by stripping away extraneous text. For example, we would like to treat "ANTHOLOGY 1", "ANTHOLOGY 2" and "ANTHOLOGY 3" the same for grouping purposes. To achieve this:
Build the collection and preview the result. Observe how many more times similar titles have been amalgamated under the same bookshelf. Test your understanding of regular expressions by trying to rationalize the amalgamations. (Note: [[:punct:]] stands for any punctuation character.) The icons beside the Word and PDF documents are not the correct ones, but that will be fixed in the next format statement.
One powerful use of regular expressions in the exercise was to clean up the Titles browser. Perhaps the best way of doing this would be to have proper title metadata. The metadata extracted from HTML files is messy and inconsistent, and this was reflected in the original Titles browser. Defining proper title metadata would be simple but rather laborious. Instead, we have opted to use regular expressions in the AZCompactList classifier to clean up the title metadata. This is difficult to understand, and a bit fiddly to do, but if you can cope with its idiosyncrasies it provides a quick way to clean up the extracted metadata and avoid having to enter a large amount of metadata.
Using non-standard macro files
To put finishing touches to our collection, we add some decorative features
- Close the collection in the Librarian Interface (File → Close).
- Using your file browser outside Greenstone, locate the folder
sample_files → beatles → advbeat_large
- Open up another file browser, and locate the small beatles collection in your Greenstone installation:
Greenstone → collect → smallbea
smallbea is the folder name generated by Greenstone for this collection. You can determine what the folder name is for a collection by looking at the title bar of the Librarian Interface: the folder name is displayed in brackets after the collection name.
- Using the file browser, copy the images and macros folders from the advbeat_large folder into the smallbea folder. (It's OK to overwrite the existing images folder: the image in it is included in the folder being copied.) The images folder includes some useful icons, and the macros folder defines some macro names that use these images.To see the macro definitions, open the collection in the Librarian Interface (File → Open...) and view the Collection Specific Macros section in the Format panel.
Using different icons for different media types
- Open the collection in GLI again and update VList your format statement (in Format Features on the Format panel) to be the following. You can copy this text from the file sample_files → beatles → format_tweaks → multi_icons.txt.
<td valign=top>
{If}{[numleafdocs],[link][icon][/link]}
{If}{[dc.Format] eq 'Lyrics',[link]_iconlyrics_[/link]}
{If}{[dc.Format] eq 'Discography',[link]_icondisc_[/link]}
{If}{[dc.Format] eq 'Tablature',[link]_icontab_[/link]}
{If}{[dc.Format] eq 'MARC',[link]_iconmarc_[/link]}
{If}{[dc.Format] eq 'Images',[srclink][thumbicon][/srclink]}
{If}{[dc.Format] eq 'Supplementary',[srclink][srcicon][/srclink]}
{If}{[dc.Format] eq 'Audio',[srclink]{If}{[FileFormat] eq 'MIDI',_iconmidi_,_iconmp3_}[/srclink]}
</td>
<td>
{If}{[numleafdocs],([numleafdocs])}
</td>
<td valign=top>
[highlight]
{Or}{[dc.Title],[Title],Untitled}
[/highlight]
</td>
-
Preview your collection as before. Now different icons are used for discography, lyrics, tablature, and MARC metadata. Even MP3 and MIDI audio file types are distinguished. If you let the mouse hover over one of these images a "tool tip" appears explaining what file type the icon represents in the current interface language (note: extra.dm only defines English and French).
Changing the collection's background image
- Go to the Collection Specific Macros section in the Format panel.
- The content is fairly brief, specifying only what needs to be overridden from the default behaviour for this collection. Near the top you should see:
_collectionspecificstyle_ {
<style>
body.bgimage \{ background-image: url("_httpcimages_/beat_margin.gif"); \}
\#page \{ margin-left: 120px; \}
</style>
}
Replace the text beat_margin.gif with tile.jpg.This line relates to the background image used. The new image tile.jpg was in the images folder that was copied across previously.
-
Preview the collection's home page. The page background is now the new graphic.Other features can be altered by editing the macros—for example, the headers and footers used on each page, and the highlighting style used for search terms (specify a different colour, use bold etc.).
Building a full-size version of the collection
- To finish, let's now build a larger version of the collection. To do this:
- Close the current collection (File → Close).
- Start a new collection called large beatles (File → New...).
- Base this new collection on small beatles.
- Copy the content of sample_files → beatles → advbeat_large → import into this newly formed collection. Since there are considerably more files in this set of documents the copy will take longer.
-
Build the collection and preview the result. (If you want the collection to have an icon, you will have to add it from the Format panel.)
Adding an image collage browser
- Switch to the Design panel and select the Browsing Classifiers section. Pull down the Select classifier to add menu and select Collage. Click <Add Classifier...>. There is no need to customize the options, so click <OK> at the bottom of the resulting popup.
- Now change to the Create panel and build and preview the collection. Try out the collage browsing classifier. You can click on any image during the collage display and the image will be opened up.
Scanned image collection
Sample files:
niupepa.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
Here we build a small replica of Niupepa, the Maori Newspaper collection, using five newspapers taken from two newspaper series. It allows full text searching and browsing by title and date. When a newspaper is viewed, a preview image and its corresponding plain text are presented side by side, with a "go to page" navigation feature at the top of the page.
The collection involves a mixture of plugins, classifiers, and format statements. The bulk of the work is done by PagedImagePlugin, a plugin designed precisely for the kind of data we have in this example. For each document, an "item" file is prepared that specifies a list of image files that constitute the document, tagged with their page number and (optionally) accompanied by a text file containing the machine-readable version of the image, which is used for full text searching. Three newspapers in our collection (all from the series "Te Whetu o Te Tau") have text representations, and two (from "Te Waka o Te Iwi") have images only. Item files can also specify metadata. In our example the newspaper series is recorded as ex.Title and its date of publication as ex.Date. Issue ex.Volume and ex.Number metadata is also recorded, where appropriate. This metadata is extracted as part of the building process.
- Start a new collection called Paged Images and fill out the fields with appropriate information: it is a collection sourced from an excerpt of Niupepa documents.
- In the Gather panel, open the sample_files → niupepa → sample_items folder and drag the two subfolders into your collection on the right-hand side. A popup window asks whether you want to add PagedImagePlugin to the collection: click <Add Plugin>, because this plugin will be needed to process the item files.
PagedImagePlugin will process the item files, creating a document for each one with a separate section for each page listed. Thumbnail and screen-resolution sized images of each page image will be generated.
- Go to the Create panel, build the collection and preview the result. Search for "waka" and view one of the titles listed (all three appear as Te Whetu o Te Tau). Browse by Titles and view one of the Te Waka o Te Iwi newspapers. Note that only the Te Whetu o Te Tau newspapers have text; Te Waka o Te Iwi papers don't.
This collection was built with Greenstone's default settings. You can locate items of interest, but the information is less clearly and attractively presented than in the full Niupepa collection.
Grouping documents by series title and displaying dates within each group
Under Titles, documents from the same series are repeated without any distinguishing features such as date, volume or number. It would be better to group them by series title and display other information within each group. This can be accomplished using the -bookshelf_type option to the List classifier, and tuning the classifier's format statement.
- In the Design panel, under the Browsing Classifiers section, delete the List classifier for ex.Source. This classifier is not much use.
- Select the classifier for dc.Title;ex.Title and click <Configure Classifier...>. Set bookshelf_type to always. This will create a bookshelf for each Title in the collection. Note, setting this option to duplicate_only will only create a bookshelf when more than one document shares a Title.
-
Build the collection, and preview the Titles list.
- Now we change the format statement for Titles to display more information about the documents. In the Format Features section of the Format panel, select the dc.Title;ex.Title classifier (CL1) in the Choose Feature list., and VList in the Affected Component list. Click <Add Format> to add this format statement to your collection.
Delete the contents of the HTML Format String box, and add the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → titles_tweak.txt.)
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[numleafdocs],[ex.Title] ([numleafdocs]),
Volume [ex.Volume] Number [ex.Number] Date [format:ex.Date]}
</td>
- Refresh in the web browser to view the new Titles list.As a consequence of using the bookshelf_type option of the List classifier, bookshelf icons appear when titles are browsed. This revised format statement has the effect of specifying in brackets how many items are contained within a bookshelf. It works by exploiting the fact that only bookshelf icons define [numleafdocs] metadata. For document nodes, Title is not displayed. Instead, Volume, Number and Date information are displayed.
Browsing documents by Date.
- Back in the Design panel, under the Browsing Classifiers section, add a DateList classifier, leaving its metadata option set to ex.Date.
-
Build the collection, and preview the Dates list.
- The Dates list groups documents by date. Greenstone's internal date format is YYYYMMDD, for example 18580601, and this is crucial for the DateList classifier to correctly parse date metadata and generate an ordered date list. However, the date has been made to look nice by adding a [format:] macro to Date metadata in the format statement.
- In the Format Features section of the Format panel, select All Features in the Choose Feature list, and DateList in the Affected Component list. Click <Add Format> to add this format statement to your collection. Replace the last line
<td>{Or}{[format:dc.Date],[format:exp.Date],[format:ex.Date]}</td>
with
<td>{Or}{[dc.Date],[exp.Date],[ex.Date]}</td>
Refresh in the web browser to view the new Dates list. The dates are now shown in internal format.
- Change the format statement back to reinstate the nicely formatted dates.
This can be done by selecting DateList in assigned format statements panel and clicking <Reset to Default>.
Displaying scanned images and suppressing dummy text
When you reach a newspaper, only its associated text is displayed. When either of the Te Waka o Te Iwi newspapers is accessed, the document view presents the message "This document has no text." No scanned image information (screen-view resolution or otherwise) is shown, even though it has been computed and stored with the document. This can be fixed by a format statement that modifies the default behaviour for DocumentText.
- In the Format Features section of the Format panel, select the DocumentText format statement. The default format string displays the document's plain text, which, if there is none, is set to "This document has no text." Change this to the following text. (This format statement can be copied and pasted from the file sample_files → niupepa → formats → doc_tweak.txt)
<table><tr>
<td valign=top>[srclink][screenicon][/srclink]</td>
<td valign=top>[Text]</td>
</tr></table>
Including [screenicon] has the effect of embedding the screen-sized image generated by switching the screenview option on in PagedImagePlugin. It is hyperlinked to the original image by the construct [srclink]...[/srclink]. This is a large image but it may be scaled by your browser.
This modification will display screenview image, but does nothing about the dummy text "This document has no text.", which will still be displayed. To get rid of this, edit the DocumentText format statement again and replace
<td valign=top>[Text]</td>
with
{If}{[NoText],,<td valign=top>[Text]</td>}
-
Preview the collection and view one of the Te Waka o Te Iwi documents. The line "This document has no text." should now be gone.
Searching at page level
- The newspaper documents are split into sections, one per page. For large documents, it is useful to be able to search on sections rather than documents. This allows users to more easily locate the relevant information in the document.
- Go to the Search Indexes section of the Design panel. Remove the ex.Source index and check the section checkbox to build the indexes on section level as well as document level. Make section level the default by selecting its Default radio button.
- Set the display text used for the level drop-down menu by going to the Search section on the Format panel. Set the document level text to "newspaper", and the section level text to "page".
-
Build and preview the collection.Compare searching at "newspaper" level with searching at "page" level. A useful search term for this collection is "aroha".
Tidying up search results
You will notice that when searching for individual pages, a thumbnail of the newspaper image is displayed in the search results. For text pages like this, these are not very useful. Let's tell PagedImagePlugin not to generate thumbnails.
- In the Design panel, under the Document Plugins section, select PagedImagePlugin from the Assigned Plugins list and click <Configure Plugin...>. Switch on the create_thumbnail option and set its value to false.
-
Rebuild and preview the collection, doing a search at page level.
Search results at newspaper level display the original filename. Let's remove that also.
- Go to Format Features section of the Format panel in the Librarian Interface, choose All Features in Choose Feature list, and select the VList format statement from the list of assigned format statements. Remove the following from the last line of the format string:
{If}{[ex.Source],<br><i>([ex.Source])</i>}
Preview the collection.
You might notice that newspaper level search results only display the newspaper Title, and not any volume information, while page level search results only show a large scan of the newspaper page, the Title of the page (the page number), and not the Title of the newspaper. We'll modify the format statement to show Volume and Number information, and for page results, the newspaper title as well as the page number.
- In the Format Features section, select Search in Choose Feature, and VList in Affected Component. Click <Add Format> to add this format to the collection. The previous changes modified VList, so they will apply to all VLists that don't have specific format statements. These next changes are made to SearchVList so will only apply to search results. The extracted Title for the current section is specified as [ex.Title] while the Title for the parent section is [parent:ex.Title]. Since the same SearchVList format statement is used when searching both whole newspapers and newspaper pages, we need to make sure it works in both cases.Set the format statement to the following text (it can be copied and pasted from the file sample_files → niupepa → formats → search_tweak.txt):
<td valign="top">[link][icon][/link]</td>
<td valign="top">
{If}{[parent:ex.Title],[parent:ex.Title] Volume [parent:ex.Volume] Number [parent:ex.Number]: Page [ex.Title],
[ex.Title] Volume [ex.Volume] Number [ex.Number]}
<br/><i>({Or}{[format:parent:ex.Date],[format:ex.Date],undated})</i></td>
</td>
Preview the search results. Items display newspaper Title, Volume, Number and Date, and pages also display the page number.
The collection you have just built involves a fairly complex document structure. There are two series of newspapers, Te Waka and Te Whetu.
In the Te Waka series there are two actual newspapers, Volume 1 Numbers 1 and 2. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 4 pages, numbered 5, 6, 7, 8. The page numbers increase consecutively through each volume, despite the fact that the volume is divided into different Numbers. Each page in the Te Waka series is represented by a single file, a GIF image of the page.
The Te Whetu series has three actual newspapers, Volume 1 Numbers 1, 2, and 3. Number 1 has 4 pages, numbered 1, 2, 3, 4; Number 2 has 5 pages, numbered 5, 6, 7, 8, 9; Number 3 has 5 pages, numbered 10, 11, 12, 13, 14. Again the page numbers increase consecutively through each volume. Each page in this series is represented by two files, a GIF image of the page and a text file containing the OCR’d text that appears on it.
The key to this structure is in the respective .item files. Here is a synopsis of the information they contain:
(9-1-1) Te Waka Volume 1 Number 1
p.1 gif
p.2 gif
p.3 gif
p.4 gif
(9-1-2) Te Waka Volume 1 Number 2
p.5 gif
p.6 gif
p.7 gif
p.8 gif
(10-1-1) Te Whetu Volume 1 Number 1
p.1 gif text
p.2 gif text
p.3 gif text
p.4 gif text
(10-1-2) Te Whetu Volume 1 Number 2
p.5 gif text
…
p.9 gif text
(10-1-3) Te Whetu Volume 1 Number 3
p.10 gif text
…
p.14 gif text
Advanced scanned image collection
In this exercise we build upon the collection created in the Scanned image collection exercise. We add a new newspaper by creating an item file for it, add a new newspaper using the extended XML item file format, and modify the formatting.
Adding another newspaper to the collection
Another newspaper has been scanned and OCRed, but has no item file. We will add this newspaper into the collection, and create an item file for it.
- In the Librarian Interface, open up the Paged Image collection that was created in exercise Scanned image collection if it is not already open (File → Open...).
- In the Gather panel, add the folder sample_files → niupepa → new_papers → 12 to your collection. Inside the 12 folder you can see that there are 4 images and 4 text files.
- Create an item file for the collection. Have a look at an existing item file to see the format. Start up a text editor (e.g. WordPad) to open a new document. Add some metadata. The Title for this newspaper is "Te Haeata 1859-1862". The Volume is 3, Number is 6, and the Date is "18610902". (Greenstone's date format is yyyymmdd.) Metadata must be added in the form:
<Metadata name>Metadata value
For this document, the metadata looks like:
<Title>Te Haeata 1859-1862
<Date>18610902
<Volume>3
<Number>6
- For each page, add a line in the file in the following format:
pagenum:imagefile:textfile
For example, the first page entry would look like
1:images/12_3_6_1.gif:text/12_3_6_1.txt
Note that if there is no text file, you can leave that space blank. You need to add a line for each page in the document. Make sure you increment the page number as well as the image number for each line. (The full text for this file can be copied from sample_files → niupepa → formats → 12_3_6.item.)
- Save the file using Filename 12_3_6.item, and save as a plain text document. (If you are using Windows, make sure the file doesn't accidentally end up getting saved as 12_3_6.item.txt.) Back in the Gather panel of the Librarian Interface, locate the new file in the Workspace tree, and drag it into the collection, adding it to the 12 folder.
-
Build the collection and preview. Check that your new document has been added.
XML based item file
There are two styles of item files. The first, which was used in the previous section, uses a simple text based format, and consists of a list of metadata for the document, and a list of pages. This format allows specification of document level metadata, and a single list of pages.
The second style is an extended format, and uses XML. It allows a hierarchy of pages, and metadata specification at the page level as well as at the document level. In this section, we add in two newspapers which use XML-based item files.
- In the Gather panel, add the folder sample_files → niupepa → new_papers → xml (you need to add the xml folder, not the 23 folder) to your collection.
- Open up the file xml → 23 → 23__2.item and have a look at the XML. This is Number 2 of the newspaper titled Matariki 1881. The contents of this document have been grouped into two sections: Supplementary Material, which contains an Abstract, and Newspaper Pages, which contains the page images (and OCR text).
-
Build and preview the collection. The xml style items have been included, but the document display for these items is not very nice.
Using process_exp to control document processing
- Paged documents can be presented with a hierarchical table of contents, or with next and previous page arrows, and a "go to page" box (like we have done so far). The display type is specified by the documenttype (hierarchy|paged) option to PagedImagePlugin. The next and previous arrows suit the linear sequence documents, while the table of contents suits the hierarchically organised document. Ordinarily, a Greenstone collection would have one plugin per document type, and all documents of that type get the same processing. In this case, we want to treat the XML-based item files differently from the text-based item files. We can achieve this by adding two PagedImagePlugin plugins to the collection, and configuring them differently.
- Go to the Document Plugins section of the Design panel, and add a new PagedImagePlugin plugin. Enable the create_screenview option, set the documenttype option to hierarchy and set the process_exp option to xml.*\.item$ and click OK.
- Move this PagedImagePlugin plugin above the original one in the Assigned Plugins list.
- The XML based newspapers have been grouped into a folder called xml. This enables us to process these files differently, by utilizing the process_exp option which all plugins support. The first PagedImagePlugin in the list looks for item files underneath the xml folder. These documents will be processed as 'hierarchical' documents. Item files that don't match the process expression (i.e. aren't underneath the xml folder) will be passed onto the second PagedImagePlugin, and these are treated as 'paged' documents.
Rebuild and preview the collection. Compare the document display for a paged document e.g. Te Waka o Te Iwi, Vol. 1, No. 1 with a hierarchical document, e.g. Matariki 1881, No. 1.
Switching between images and text
We can modify the document display to switch between the text version and the screenview and full size versions. We do this using a combination of format statements and macro files.
- First of all we will add a macro file to the collection. Close the collection in the Librarian Interface. In a file browser outside of Greenstone, locate the Paged Image collection in your Greenstone installation: Greenstone → collect → pagedima.Also in a file browser, locate the file sample_files → niupepa → macros → extra.dm. Copy this file and paste it into the macros folder inside the pagedima collection.
- Back in the Librarian Interface, open up the collection again, and go to the Format Features section of the Format panel.
- Select AllowExtendedOptions in the Choose Feature list, and click <Add Format>. Tick the Enabled checkbox. This gives us more control over the layout of the page—in this case, we want to replace the standard DETACH and NO HIGHLIGHTING buttons with buttons that switch between images and text.
- Select the DocumentHeading format item and set it to the following text (which can copied from sample_files → niupepa → formats → adv_doc_heading.txt).
<div class="heading_title">{Or}{[parent(Top):ex.Title],[ex.Title]}</div>
<div class="buttons" id="toc_buttons">
{If}{[srcicon],_document:viewfullsize_}
{If}{[screenicon],_document:viewpreview_}
{If}{[NoText] ne '1',_document:viewtext_}
</div>
<div class="toc">[DocTOC]</div>
{Or}{[parent(Top):ex.Title],[ex.Title]} outputs the newspaper Title metadata. This is only stored at the top level of the document, so if we are at a subsection, we need to get it from the top ([parent(Top):ex.Title]). Note that we can't just use [parent:ex.Title] as this retrieves the Title from the immediate parent node, which may not be the top node of the document.
_document:viewpreview_, _document:viewfullsize_, _document:viewtext_ are macros defined in extra.dm which output buttons for preview, fullsize and text versions, respectively. We choose which buttons to display based on what metadata and text the document has. (Note: you can view the macros by going to the Collection Specific Macros section of the Format panel.)
[DocTOC] is the document table of contents or "go to page" navigation element. Since we are using extended options, we need to explicitly specify this for it to appear in the page.The different pieces are surrounded by <div> elements, so that the appropriate styling information can be used.
- Select the DocumentText format statement and set it to the following text (which can be copied from sample_files → niupepa → formats → adv_doc_text.txt):
{If}{_cgiargp_ eq 'fullsize',[srcicon],
{If}{_cgiargp_ eq 'preview',[screenicon],
{If}{[NoText] ne '1',[Text],[screenicon]}}}
This format statement changes the display based on the "p" argument (_cgiargp_). This is not used normally for document display, so we can use it here to switch between full size image ([srcicon]), preview size image ([screenicon]) and text ([Text]) versions of each page.
-
Preview the collection. View some of the documents—once you have reached a newspaper page, you should get fullsize, preview and text options.
Open Archives Initiative (OAI) collection
Sample files:
oai.zip
Devised for Greenstone version: 2.60|3.06
Modified for Greenstone version: 2.87|3.08
This exercise explores service-level interoperability using the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). So that you can do this on a stand-alone computer, we do not actually connect to the external server that is acting as the data provider. Instead we have provided an appropriate set of files that take the form of XML records produced by the OAI-PMH protocol.
One of Greenstone's documented example collections is sourced over OAI. This exercise takes you through the steps necessary to reconstruct it. You may wish to take a look at the documented example collection OAI demo now to see what this exercise will build.
- Start a new collection called OAI Service Provider. Fill out the fields with appropriate information.
- In the Gather panel, locate the folder sample_files → oai → sample_small → oai. Drag this folder into the collection and drop it there.
- During the copy operation, a popup window may appear asking whether to add OAIPlugin to the list of plug-ins used in the collection, because the Librarian Interface has not found an existing plug-in that can handle this file type. Press the <Add Plugin> button to include it.The files for this collection consist of a set of images (in JCDLPICS → srcdocs) and a set of OAI records (in JCDLPICS) which contain metadata for the images.
When files are copied across like this, the Librarian Interface studies each one and uses its filename extension to check whether the collection contains a corresponding plug-in. No plug-in in the list is capable of processing the OAI file records that are copied across (they have the file extension .oai), so the Librarian Interface prompts you to add the appropriate plug-in.
Sometimes there is more than one plug-in that could process a file—for example, the .xml extension is used for many different XML formats. The popup window, therefore, offers a choice of all possible plug-ins that matched. It is normally easy to determine the correct choice. If you wish, you can ignore the prompt (click <Don't Add Plugin>), because plug-ins can be added later, in the Document Plugins section of the Design panel.
- You will need to specify which document the OAI metadata values should be attached to. In the Design panel, select the Document Plugins section, then select the OAIPlugin and click <Configure Plugin...>. Locate the document_field option in the popup window and type ex.dc.Identifier (it may not be available in the drop-down list until after building). Click <OK>. Finally, you may want to remove the EmbeddedMetadataPlugin to speed up building (since it's not going to extract metadata relevant to this tutorial anyway).
- You also need to configure the image plug-in. Select the ImagePlugin line in the Document Plugins section and click <Configure Plugin...>. In the resulting popup window locate the screenviewsize option, switch it on, and type the number 300 in the box beside it to create a screen-view image of 300 pixels. Click <OK>.
- Now switch to the Create panel and build and preview the collection.
OAIPlugin will process the OAI records, and assign metadata to the images, which are processed by ImagePlugin.
Like other collections we have built by relying on Greenstone defaults, the end result is passable but can be improved. The next steps refine the collection using the metadata harvested by OAI-PMH into the .oai files.
- In the Browsing Classifiers section of the Design panel, delete the two List classifiers (dc.Title;ex.Title and ex.Source).
- Add an AZCompactList classifier based on ex.dc.Subject metadata. Configure it with Subjects as the buttonname.
- Now add an AZCompactList classifier based on ex.dc.Description metadata. In its configuration panel set mingroup to 2, mincompact to 1, maxcompact to 10 and buttonname to Captions.Setting mingroup to 2 will mean that two or more documents with the same description will be grouped into a bookshelf; the default mingroup of 1 means that every document will get a bookshelf. mincompact and maxcompact control how many documents are grouped into each section of the horizontal A-Z list. In this case, each group can have as few as one document, and no more than ten.
- In the Search Indexes section of the Design panel, delete all indexes and add a new one based on ex.dc.Description metadata. Set the Display text for the ex.dc.Description index by going to the Search section in the Format panel and changing its label to "_labelDescription_". Using a macro for an index name means that it will display in the correct language (assuming that the macro has been translated). You can check Greenstone → macros → english.dm to see which macros are available..
-
Build the collection and preview it.
Tweaking the presentation with format statements
- In the Format panel, select Format Features. First replace the VList format statement with the following (which can be copied from the file vlist_tweak.txt in the sample_files → oai →format_tweaks folder).
<td>
{If}{[numleafdocs],[link][icon][/link],[link][thumbicon][/link]}
</td>
<td valign=middle>
{If}{[numleafdocs],[Title],<i>[ex.dc.Description]</i>}
</td>
This format statement customizes the appearance of vertical lists such as the search results and captions lists to show a thumbnail icon followed by Description metadata.
-
Next, select DocumentHeading from the Format Features list and change its format statement to:
<h3>[ex.dc.Subject]</h3>
The document heading appears above the DETACH and NO HIGHLIGHTING buttons when you get to a document in the collection. By default DocumentHeading displays the document's ex.Title metadata. In this particular set of OAI exported records, titles are filenames of JPEG images, and the filenames are particularly uninformative (for example, 01dla14). You can see them in the Enrich panel if you select an image in oai → JCDLPICS → srcdocs and check its ex.Source and ex.Title metadata. The above format statement displays ex.dc.Subject metadata instead.
- Finally, you will have noticed that where the document itself should appear, you see only "This document has no text.". To rectify this, select DocumentText in the Choose Feature pull-down list and use the following as its format statement (this text is in doctxt_tweak.txt in the format_tweaks folder mentioned earlier):
<center><table width=_pagewidth_ border=1>
<tr><td colspan=2 align=center>
<a href=[ex.dc.OrigURL]>[screenicon]</a></td></tr>
<tr><td>Caption:</td><td> <i>[ex.dc.Description]</i> <br>
(<a href=[ex.dc.OrigURL]>original [ImageWidth]x[ImageHeight] [ImageType] available</a>)
</td></tr>
<tr><td>Subject:</td><td> [ex.dc.Subject]</td></tr>
<tr><td>Publisher:</td><td> [ex.dc.Publisher]</td></tr>
<tr><td>Rights:</td><td> [ex.dc.Rights]</td></tr>
</table></center>
This format statement alters how the document view is presented. It includes a screen-sized version of the image that hyperlinks back to the original larger version available on the web. (Unfortunately, the original versions of the images in this sample collection are no longer available on the web. If you want the link to lead to the local copy of the full size image, then use [ex.srclink]...[/ex.srclink] in place of <a href=[ex.dc.OrigURL]>...</a>.) Image property information extracted from the image, such as width, height and type, is also displayed as a consequence of using the above format statement.
- Format statements are processed by the runtime system, so the collection does not need to be rebuilt for these changes to take effect. Click <Preview Collection> to see the changes.
Setting up your Greenstone OAI Server
Greenstone 2 collections are not enabled for OAI out of the box. To make a collection available for serving up over OAI, some minor adjustments need to be made first. This tutorial will look at how to make an existing collection available over OAI and testing its accessibility by getting it validated against the Open Archives validator.
- Use a text editor to open the file etc/oai.cfg located in your Greenstone installation folder. The oai.cfg configuration file contains properties that control the behaviour and features of your Greenstone OAI server.The basic properties to edit in order to get your collection served by the inbuilt OAI server are the repositoryName, repositoryID and oaicollection. Look up these properties in the file.For repositoryName and repositoryID, type in some values that make sense for your digital library. For example:
repositoryName "Greenstone"
repositoryID "greenstone"
- For this tutorial, we'll make the backdrop collection created in the simple image tutorial available over OAI. Therefore, add this collection's name to the end of the oaicollection property:
oaicollection demo documented-examples/oai-e backdrop
If you have a great many documents and do not want the OAI server to return all of them in one go, you could set the resumeafter property to something lower than the default 250 value in the oai.cfg file. Like:
resumeafter 50
- If you're on Windows, it's best to be using the Apache web server. So if you're using the Local Library Server, stop the web server by exiting the little white dialog (the Greenstone Server Interface). Use a file browser to go into your Greenstone installation directory and rename the server.exe there to server.not to disable it. Now re-launch the Greenstone Server from the Start menu, so that this time, the included Apache web server will be used instead, launching its own little white dialog.
- You are now ready to visit your oaiserver home page to check that it's all looking good. Start up the Greenstone Server by going to Windows Start → All Programs → Greenstone 2.87 → Greenstone Server.Press the Enter Library button and you will end up on your Digital Library home page as usual. Adjust the URL so that instead of the library.cgi suffix, it says oaiserver.cgi.The page that loads now will contain an error message (badVerb) saying that you've provided an illegal OAI verb. This is because the OAI specification requires you to provide more instruction in the URL as to what you want. The specification defines verbs and possible arguments to them.A basic verb is Identify, which requests the OAI server to return some information about the OAI repository that it's serving. Adjust the URL once more by suffixing ?verb=Identify, so that your URL now looks like:
http://<domain>/greenstone/cgi-bin/oaiserver.cgi?verb=Identify
Visiting this page now gives some information about your Greenstone OAI repository.
- Although the data transmitted over OAI is in the form of XML, Greenstone uses a stylesheet to transform that XML response into a user-friendly, structured web page that you see when you perform the Identify request (as happens when you visit the verb=Identify response page). This allows Identify and other verbs in the OAI specification to be shown in the main Greenstone OAI Server pages as link buttons. You can see these verbs represented in the main Greenstone oaiserver.cgi (or oaiserver.cgi?verb=Identify) page as a row of links, starting with "Identify" at the top and in the lower end of the page.Clicking on the links will execute that verb as a request and return the response from your Greenstone OAI server as a structured web page. Try clicking on all the links.
- OAI defines a concept called a Set. In Greenstone, the OAI Set concept is mapped to the practical Greenstone collection. The link to the ListSets verb will therefore request the Greenstone OAI server to list all the collections that have been enabled for OAI.Click on the ListSets link and have a look.The response page for the ListSets verb will show you that your backdrop collection (created in the Simple image collection tutorial) is one of the collections available over OAI in your Greenstone repository.
- You will see a couple of buttons next to each collection (or Set) listed here. The first is Identifiers and the second Records. Click on the Identifiers button for the backdrop Set. This will list all the IDs of the documents contained in your OAI collection.If you look at the IDs, they look similar enough to Greenstone's internal document IDs, but with an additional prefix (oai:<repositoryID>:<setname>, where repositoryID was set by you in the oai.cfg configuration file, and setname is the name of the collection).
- Click the browser Back button to get back to the ListSets page and press the Records button located next to the backdrop collection.If you had specified some Dublin Core (dc) metadata for each of the images in the backdrop collection, then the page that loads will display this information for each document in the collection (Set).Greenstone's OAI at present supports 3 metadata formats, as is explained in the instructive comments in the oai.cfg file. Of these three, the OAI standard for Dublin Core, oai_dc, is the one pertinent to this tutorial. If your collection specifies metadata for a different metadata set format, you can use the oai.cfg file to tell Greenstone how to map the metadata fields of your chosen metadata set format into the Dublin Core metadata set supported by the Greenstone OAI server (or one of the other metadata sets it supports).Look in the oai.cfg file again and scroll down to the section on oaimapping, which will explain and provide examples for how to specify such mappings from your metadata format to one that Greenstone's OAI server uses. For instance, the demo collection comes enabled for OAI upon installation, and specifies some mappings from its DLS metadata format to OAI DC. Its dls.Title metadata is mapped to oai_dc.title using the following line in the oai.cfg configuration file (note the use of case):
oaimapping dls.Title oai_dc.title
Because the backdrop collection uses DC metadata already, no mapping is required.
Validating the Greenstone OAI server
In this section, you'll be testing that you've set up your Greenstone OAI server correctly so that it's accessible over OAI. For this part of the exercise, you need to be on a networked computer and your host computer needs to be visible to the outside world. (That is, when you provide the full name of your computer, someone else in the world should be able to find that computer by typing its URL into their browser's address field.)
We'll be using an external OAI client to access our up-and-running Greenstone OAI server. It's not just any OAI client either, but an OAI Server validator.
- You will want to be running the included Apache web server. So if you're on Windows and using the Local Library Server, quit it and rename the server.exe application in your Greenstone installation folder to server.not. Then use the Start menu shortcut to the Greenstone Server once more, to now launch the Apache web server.
- For this exercise, we will be visiting the Open Archives Validator, for which your OAIserver needs to provide a valid email address. In a text editor, open up your greenstone installation's etc/oai.cfg file and set the value of the maintainer field to your email address.Note that by default, your Greenstone installation will make the demo collection available over OAI. This collection has been set up with a dummy (and invalid) email address for the creator and maintainer fields in the collection's collect.cfg file. You will need to open up collect/demo/etc/collect.cfg and clear the email values for the creator and maintainer properties (or else set these to a valid email again). Otherwise the OpenArchives validator will resort to using the demo collection's default dummy email to send the initial validation results to. Alternatively, you can simply remove the demo collection from being listed in the oai.cfg file's oaicollection property, which will cease to make the demo collection available over OAI.Note also that, if you wish to specify contact emails at a collection level, you will need to edit your greenstone installation's collect/<collection-name>/etc/collect.cfg file for those collections and set the creator and maintainer fields to the desired email address.
- If your collection contains document items for which you have not assigned any (Dublin Core, dc) metadata, the OAI validation can fail because it is dependent on having Metadata Formats listed even on a per record (per document) basis. Therefore, if your document has no dc metadata assigned, Greenstone won't know what OAI-supported metadata format is used by that document in order to list it.In practice, this means that you either have to assign one or more dc.* metadata to each document in your OAI collection, or you will have to set up an oaimapping in the oai.cfg file to map existing metadata of whichever format to dc.* metadata.For instance, if you created an image collection without assigning any metadata and are happy to use the Title or Source metadata that Greenstone extracted for each image (ex.Title, ex.SourceFile) as the image document's "title", you could map either of these metadata to dc.Title in the file oai.cfg. To do so, you'd open up oai.cfg in an editor, go down to the section specifying the oaimapping properties and add a new line:
oaimapping Title oai_dc.title
(Or: oaimapping SourceFile oai_dc.title).This step will not be not necessary for the backdrop collection if you had assigned any dc.* metadata for each image in the collection.
Note: If the demo collection that comes with a Greenstone installation is not built, it will either need to be built before submitting your OAI server for inspection by the Open Archives validator, or you will need to adjust the oai.cfg file once more by removing the mention of
demo
from the
oaicollection
property. This is because the demo collection is mentioned as being set up for OAI in the oai.cfg file. However, if this collection is unbuilt, it will not be accessible to the OAI validator and so your oaiserver may fail tests due to this oversight.
- If you are working with legacy collections (built before Greenstone version 2.85) you may have to rebuild them if you plan to make them available over OAI and be compliant with the Open Archives validator. Rebuilding old collections will recalculate the earliest datestamp value for the repository. This calculation is different from Greenstone 2.85 onwards.
- Next you will need to set up your Greenstone server to be accessible from outside, so that external OAI clients can access it.Go to the File → Settings menu of your Greenstone server interface dialog and check the Allow External Connections option and also check the Get local IP and resolve to a name option (or the Get local IP option) as its address resolution method.
- Press the button in the Greenstone Server Interface dialog that says Enter Library (or it may say Restart Library). Your Digital Library home page will open up in a browser tab. Adjust this URL to have a suffix of oaiserver.cgi in place of the terminating library.cgi, then copy the resulting URL and visit http://www.openarchives.org/Register/ValidateSite.
- The Open Archives Validator page will request the URL to your Greenstone OAI server. Paste the URL you have in your copy buffer into the field provided for this, and press the Validate baseURL button to start running the tests. You will be told to check the adminEmail address you provided to continue the remaining tests and to get the validation report.If the validator does not recognise the URL, make sure you have given the full domain of your host machine rather than just the host name. If that URL is still not accepted, visit the oaiserver.cgi?verb=Identify page again and check this works. If it doesn't, it may be that your machine is not set up to be accessible to outside networks. Check your proxy settings, make sure you've set up port forwarding and that your firewall is not interfering.
Downloading over OAI
GLI can serve as an OAI client application: it can connect to a remote OAI server and retrieve metadata, even download documents. The tutorial Open Archives Initiative (OAI) collection did not obtain the data from an external OAI-PMH server. This missing step is accomplished either by running a command-line program or by using the Download panel in the Librarian Interface. This exercise explains how you would do this using both methods. In the previous exercise, we set up the Greenstone server to serve the Simple image collection (backdrop) over OAI. In this tutorial, we will use GLI to connect to that OAI server and download OAI metadata for the Simple image collection and even download its documents. The principle is the same if you wish to connect to other OAI servers.
Downloading using the Librarian Interface
- Launch GLI. This should launch the Greenstone server as well, if this is not already running, so that the OAI server is also up and running.
- In GLI, go to the Download panel. To the left, choose OAI as the Download Setting.
- On the right, set the Source URL field to contain the URL to your Greenstone OAI server. It would be of the form
http://<hostname:portnumber>/greenstone/cgi-bin/oaiserver.cgi
Make sure that you can generally access this URL from your browser.
Visit the library home page, as this will load the greenstone collections, so that any associated files like images or pdf documents become accessible for download. (Without visiting the library home page, the collections would not be loaded and the images from the Simple Images collection, that we will be downloading below alongside the oai files, will not be available for download.)
- If the server is not running on localhost and your computer is behind a firewall or proxy server, you may need to edit the proxy settings in the Librarian Interface. Click the <Configure Proxy...> button. Switch on the Use proxy connection? checkbox. Enter the proxy server address and port number in the HTTP Proxy Host: and Port: boxes. Click <OK> to get back to the OAI section of the Download panel.
- If at this stage you were to press the <Server Information> (in the central row of buttons), a dialog will pop up with basic details about the OAI server. At the end, it will diplay the names of the sets available via that OAI Server. In our example, backdrop (the Simple Image collection) would be listed as one of the setNames. Press the <close> to close the Server Information dialog.
- Tick the Metadata prefix checkbox as well as the Restrict to set checkbox. For the latter, type backdrop for the set name. Then tick Get document. Also tick Only include file types and include jpg in the list of comma separated values for it so that it becomes
jpg,doc,pdf,ppt
Next, tick Max records and set it to 10. There will be 9 images in the collection, so we don't really need to set the Max records value, but this is a helpful feature that you can use when downloading from an OAI server.
- Finally, click <Download>, located beside the Server Information button. If you have set proxy information in Preferences..., a popup will ask for your user name and password. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing. GLI will download oai metadata and, because we have ticked the Get document checkbox, it will also be retrieving actual documents, but not more than 10, because of the limit of 10 that we've placed on the number of records to download.
- After a while, it will have finished downloading. Change to the Gather panel, and on the left-hand side, open up the Downloaded Files folder. This is where Greenstone stores files you downloaded using the Download panel. In this case, it will contain a folder wherein the oai metadata files and images that you've just downloaded from your own Greenstone OAI server is stored. These files can then be added to a collection, as will be covered later in this tutorial.
Downloading using the command line
For command line downloading to work, your computer must have a direct connection to the Internet—being behind a firewall may interfere with the ability to download the information. You will need to use the Librarian Interface for downloading if you are behind a firewall.
- Close the Librarian Interface.
- Start up the Greenstone server application.
- If you're on Windows, open a DOS window to access the command-line prompt. This facility should be located somewhere within your Start → Programs menu, but details vary between different Windows systems. If you cannot locate it, select Start → Run, enter cmd in the popup window that appears and hit Enter.If you're on Linux or Mac, open a terminal.
- Before you start, you must set up your Greenstone environment in the terminal. In the DOS window or terminal, move to the home directory where you installed Greenstone. This is accomplished by something like:
cd C:\Program Files\Greenstone
- Type:
setup.bat
to set up the ability to run Greenstone command-line programs. On Linux/Mac, you would run source setup.bash.
GLI uses a perl script, downloadfrom.pl, to do the downloading. This can be run on the command line, outside of GLI.
The downloadfrom.pl script can download using several different protocols. These are specified using the -download_mode option. To see the available options for download mode, run perl -S downloadfrom.pl -h. This shows that the current options are: Web, MediaWiki, OAI, Z3950, SRW. For OAI downloading, use -download_mode OAI.
To see the options for downloading using the OAI mode, you can run perl -S downloadinfo.pl OAIDownload. The options are the same as you can see in the GLI OAI download panel.
- We'll use the set and max_records OAI Download options to download a maximum of 5 OAI records from the backdrop collection at your Greenstone's OAI server, which was made available over OAI as a set in the previous tutorial:
perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone/cgi-bin/oaiserver.cgi -set backdrop -max_records 5
The OAI records will be downloaded into the folder where the downloadfrom.pl script is run from. To change this, use the -cache_dir full-path-to-folder option and set its value to the full path of the destination folder you choose. (If you wanted to download the documents along with the records, then you would additionally pass in the -get_doc flag to the above command as well as the -get_doc_exts flag followed by a comma-separated list of file extensions like "jpg,pdf".)
perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone/cgi-bin/oaiserver.cgi -set backdrop -max_records 5 -get_doc -get_doc_exts "jpg,pdf"
You can import the downloaded documents into a new Greenstone collection and build them in the usual manner.
Building the downloaded documents in GLI
- If you used GLI to download documents over OAI, as seen in the first part of the tutorial, you can find the downloaded items in the Downloaded Files folder in the filesystem view on the left side of the Gather panel.If you used the command line to download documents, the downloaded files will be stored wherever you ran the downloadfrom.pl script from.
- Open GLI, locate the files you downloaded over OAI and drag and drop these into a new Greenstone collection called OAI Collection. Because there are *.oai files among those downloaded, GLI will offer to add the OAIPlugin.
- Go to the Design panel, and configure the OAIPlugin by ticking its no_cover_image option. Generally, Greenstone will look for any images that have an identical name to the primary document being processed and will associate the image with the document as being the document's cover image. Because the OAI files and the image documents downloaded over OAI have matching names, each image would get treated as the cover image for its associated OAI file. We don't want that behaviour here, so we turn on the no_cover_image option. This allows the OAIPlugin to attach the metadata of each OAI file with its associated image (treating the image as the primary document, instead of as a cover image), just as intended.Note that this time, we don't configure the OAIPlugin's document_field option to ex.dc.Identifier, because the OAI files that have been downloaded over OAI have the associated image's document identifier stored in the (ex.)gi.Sourcedoc metadata field. You can see this if you open up any of the downloaded OAI files in a text editor. The (ex.)gi.Sourcedoc field is consulted by default when the Greenstone building process tries to identify what source document to attach the metadata in each OAI file to.
- Switch to the Create panel and press the build button. During this stage, the OAIPlugin will extract the metadata in the oai files and attach them to the associated jpg files of the downloaded backdrop collection. You can see this once the collection has been built by switching to the Enrich panel and clicking on an oai file, as no metadata is set for such files. However, if you then click on a jpg file and scroll down, there will be metadata names that start with ex.dc. This refers to Greenstone-extracted Dublin Core metadata. ex.dc.Description and ex.dc.Title will be set to the values you had assigned the images in the tutorial A Simple Image Collection. Greenstone will have added additional ex.dc metadata in the form of ex.dc.Identifier, which is the source URL for this image.
- If you wish, you can now set up this collection in a manner similar to how the backdrop collection was set up in A simple image collection. Don't forget to copy in any specific format statements, adjust them to use the ex.dc metadata instead of dc metadata, then rebuild and preview the collection.
Use METS as Greenstone's Internal Representation
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.87
- In the Greenstone Librarian Interface, open up one of your existing collections, for example the Small HTML Collection collection.
To be able to substitute GreenstoneMETSPlugin for GreenstoneXMLPlugin you need to be in Expert mode.
- Click File → Preferences... → Mode and change to Expert mode.
- Switch to the Design panel and select Document Plugins. Remove GreenstoneXMLPlugin from the list of plug-ins and add GreenstoneMETSPlugin, with the default configuration options. Move this plugin to where GreenstoneXMLPlugin was (just below ZipPlugin).
- Now change to the Create panel, locate the options for the import process and set saveas to GreenstoneMETS. Import options are not available unless you are in Expert mode.
-
Rebuild the collection.
- In your file browser, locate the archives folder for the collection you are working with (in Greenstone → collect → <collname> → archives). For each document in the collection, Greenstone has generated two files: docmets.xml, the core METS description, and doctxt.xml, a supporting file. (Note: unless you are connected to the Internet you may be unable to view doctxt.xml in your web browser, because it refers to a remote resource.) Depending on the source documents there may be additional files, such as the images used within a web page. One of METS' many features is the ability to reference information in external XML files. Greenstone uses this to tie the content of the document, which is stored in the external XML file doctxt.xml, to its hierarchical structure, which is described in the core METS file docmets.xml.
Moving a collection from DSpace to Greenstone
Sample files:
dspace.zip
Devised for Greenstone version: 2.60
Modified for Greenstone version: 2.87
- Start a new collection called StoneD and fill out its fields appropriately.
- In the Design panel add DSpacePlugin. Leave the plugin options at their defaults and press <OK>.
- Using the up arrow, move the position of DSpacePlugin to the top of the list (above GreenstoneXMLPlugin).
- In the Gather panel, locate the folder sample_files → dspace. It contains five example items exported from a DSpace institutional repository. Copy them into your collection by dragging them over to the right-hand side of the panel. Cancel out of any dialog offering to add plugins.
-
Build the collection and preview it to see the basic defaults exhibited by a DSpace collection.
If you browse by Titles, you will find 7 documents listed, though only 5 items were exported from DSpace. Two of the original items had alternative forms in their directory folder. The DSpace plug-in options control what happens in such situations: the default is to treat them as separate Greenstone documents.
Below we use a plug-in option (first_inorder_ext) to fuse the alternative forms together. This option has the effect of treating documents with the same filename but different extensions as a single entity within a collection. One of the files is viewed as the primary document—it is indexed, and metadata is extracted from it if possible—while the others are handled as "associated files."
The first_inorder_ext option takes as its argument a list of file extensions (separated by commas): the first one in the list that matches becomes the primary document.
- Back in the Design panel's Document Plugins section, select DSpacePlugin and click <Configure Plugin...>. Switch on its configuration option first_inorder_ext. Set its value to "pdf,doc,rtf".
-
Build and preview the collection.
There are now only 5 documents, because only one version of each document has been included—the primary version.
Adding indexing and browsing capabilities to match DSpace's
The DSpace exported files contain Dublin Core metadata for title and author (amongst other things).
- In the Design panel, select Search Indexes. Delete the ex.Source index, and add one for ex.dc.Contributor. Rename the ex.dc.Contributor index by going to the Search section in the Format panel. Select this index and change its value to "_labelCreator_". Using a macro for an index name means that it will display in the correct language (assuming that the macro has been translated). You can check Greenstone → macros → english.dm to see which macros are available.
- Go back to the Design panel, select Browsing Classifiers. Select the ex.Source List classifier and click <Configure Classifier...>. Change the metadata option to ex.dc.Contributor. Activate the bookshelf_type option and set its value to always. If not already active, activate the partition_type_within_level option. Then set it to none. Finally, activate buttonname and set this to contributors. Click <OK> to close the dialog.
- Now select the Format Features section of the Format panel, and select the VList format statement in the list of assigned format statements. Add the following text before the final </td>:
{If}{[ex.equivlink],<br>Also available as:[ex.equivlink]}
- Also, let's add a format statement for the classifier based on ex.dc.Contributor metadata. In the Choose Feature menu (under Format Features on the Format panel), select the item that starts with:
CL2: List -metadata ex.dc.Contributor
- Leave VList as the Affected Component and click <Add Format>.
Edit the text in the HTML Format String box. Replace
{Or}{[dc.Title],[exp.Title],[ex.Title],Untitled}
with
{If}{[numleafdocs],([numleafdocs]) [ex.Title],[ex.dc.Title]
This will display the number of documents for each bookshelf in the Contributors classifier.
-
Build the collection once again and preview it.
There are still only 5 documents, but against some of the entries appears the line "Also available as:" followed by icons that link to the alternative representations.
Moving a collection from Greenstone to DSpace
In this exercise you export a Greenstone collection in a form suitable for DSpace. It is possible to do this from the Librarian Interface's File menu, which contains an item called Export..., that allows you to export collections in different forms. However, to gain a deeper understanding of Greenstone, we perform the work by invoking a program from the Windows command-line prompt. This requires some technical skill; if you are not used to working in the command-line environment we recommend that you skip this exercise.
Using Greenstone from the command line
- If you're on Windows, open a DOS window to access the command-line prompt. This facility should be located somewhere within your Start → Programs menu, but details vary between different Windows systems. If you cannot locate it and you are running Windows XP, select Start → Run and enter cmd in the popup window that appears. In either Windows Vista or Windows 7, click the Start button and type cmd in the search box at the bottom of the Start menu.If you're on a Unix system or on a Mac, open a terminal.
- In the DOS window or unix terminal, move to the home directory where you installed Greenstone. On Windows, this is accomplished by something like:
cd C:\Program Files\Greenstone
- Type:
setup
to set up the ability to run Greenstone command-line programs.On a Linux or Mac machine, you would similarly open a terminal, change directory into your Greenstone installation's top-level folder and type:
source ./setup.bash
- Change directory into the folder containing the StoneD collection you built in the last exercise.
cd collect\stoned
- Run the following command to export the collection using the DSpace import/export format:
perl -S export.pl -saveas DSpace -removeold stoned
Exporting in Greenstone is an additive process. If you ran the export.pl command once again, the new files exported would be added—with different folder names—to those already in the export folder. For the kind of explorations we are conducting we might re-run the command several times. The -removeold option deletes files that have previously been exported.
- This command has created a new subfolder, collect → stoned → export. Use the file browser to explore it. In it are the files needed to ingest this set of documents into DSpace.
You could equally well run the export.pl command on a different Greenstone collection and transfer the output to a DSpace installation by using DSpace's batch-import facility.
Editing metadata sets
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.87|3.08
GEMS (Greenstone Editor for Metadata Sets) can be used to modify existing metadata sets or create new ones. GEMS is launched from the Librarian Interface when you want to create a new metadata set, or edit an existing one. In this exercise, we run GEMS outside of the Librarian Interface.
Running GEMS
- Start the Greenstone Editor for Metadata Sets (GEMS)
Start → All Programs → Greenstone-2.87 → Metadata Set Editor (GEMS)
(If you're on Linux, use a terminal to run the gli/gems.sh start-up script.)
- GEMS starts up with no metadata set loaded. You can start a new set, or open an existing one, from the File menu.
Creating a new metadata set
- In this exercise, we will create a new metadata set. In order to save time, we will base it on an existing one: Development Library Subset. From the File menu, select File → New.... A popup window appears: New Metadata Set. Fill in the fields. Use "My Metadata Set" for the Metadata set title:, "my" for the Metadata set namespace:, and select "Development Library Subset Example Metadata" from the Base this metadata set on: drop down list. Click <OK>.
- The new metadata set will be displayed. The left hand side lists the elements (and sub-elements, if any) for the set, and the right hand side displays the set or element attributes. Since the new set was based on the Development Library Subset metadata set, it already contains all the elements from that set.
Adding a new element to a metadata set
- Right click on My Metadata Set in the left hand tree (or in the blank space in the left hand side) and choose Add Element from the menu that appears. In the popup window, type "Category" for the new element name, and click <OK>. The new element will appear in the list.
- In the right hand side, the default attributes will appear for the new element. "Label" and "definition" are used in the Librarian Interface when displaying metadata elements and their descriptions (the "definition" is shown as additional text for the element). These attributes can be set in multiple languages.
- Save the new metadata set by File → Save, then close the GEMS by File → Exit.
Building and searching with different indexers
Sample files:
demo.zip
Devised for Greenstone version: 2.70w|3.06
Modified for Greenstone version: 2.87|3.08
Greenstone supports three indexers
MG,
MGPP and
Lucene.
MG is the original indexer used by Greenstone which is described in the book
"Managing Gigabytes". It does section level indexing and compression of the source documents.
MG is implemented in C.
MGPP is a re-implementation of
MG that provides word-level indexes and enables proximity, phrase and field searching.
MGPP is implemented in C++ and is the default indexer for new collections.
Lucene (
http://lucene.apache.org/) is a java-based, full-featured text indexing and searching system developed by Apache. It provides a similar range of search functionality to MGPP with the addition of single-character wildcards and range searching. It was added to Greenstone to facilitate incremental collection building, which
MG and
MGPP can't provide.
Build with Lucene
- Start a new collection (File → New...) called Demo Lucene and base it on the Greenstone demo (demo) collection, fill out its fields appropriately.
- In the Gather panel, select Documents in Greenstone Collections, then select and open up Greenstone demo (demo). This will display the documents in the Greenstone demo collection. Drag all 11 folders in the demo folder into the new collection.
If you haven't installed the Greenstone demo (demo) collection yet, you can download the demo.zip file from the link above, unzip it and put it into the collect folder in your Greenstone installation.
- Go to the Enrich panel, look at the metadata that is associated with each directory. Go to the Search Indexes section in the Design panel. You'll see that the MGPP indexer is in use because the Greenstone Demo collection, which this collection is based on, uses the MGPP indexer.
- Click the Change... button at the right top corner of the panel. A new window will pop up for selecting the Indexers. After selecting an indexer, a brief description will appear in the box below. Select Lucene and click OK. Please note that the Assigned Indexes section may have changed accordingly.
-
Build and preview the collection.
Search with Lucene
- Lucene provides single letter and multiple letter wildcards and range searching. The query syntax could be quite complicated (for more information please see http://www.lucenetutorial.com/lucene-query-syntax.html). Here we will learn how to use the wildcards while constructing queries.
-
* is a multiple letter wildcard. To perform a multiple letter wildcard search, append * to the end of the query term. For example, econom* will search for words like econometrics, economist, economical, economy, which have the common part econom but different word endings.
- To perform a single letter wildcard search, use ? instead. For example, search for economi?? will only match words that have two and only two letters left after economi, such as economist, economics, and economies.
- Please note that stopwords are used by default with Lucene indexer, so searching for words like the will match 0 documents. This is explained in a message on the search page, which states that such words are too common and were ignored.
Build with MGPP
- Start a new collection called Greenstone Demo MGPP and also base it on the Greenstone demo (demo).
- In the Gather panel, drag all 11 folders from Documents in Greenstone Collections → Greenstone demo (demo) into the new collection.
- In the Search Indexes section of the Design panel, you will notice that the active indexer is MGPP, since this is the default. (If not, you'd click the Change... button, select MGPP and click OK, in which case the Assigned Indexes section and its options may change accordingly.)
- There are three options at the bottom of the panel — Stem, Casefold and Accent fold. Notice that all three are enabled. Once an option is enabled, it will also appear in the collection's PREFERENCES page and can be turned on or off from there.
- In the Indexing Levels section, also select section, if it isn't already.
-
Build and preview the collection.
Search with MGPP
- MGPP supports stemming, casefolding and accentfolding. By default, searching in collections built with the MGPP indexer is set to whole word must match and ignore case differences. So searching econom will return 0 documents. Searching for fao and FAO return the same result — 85 word counts and 11 matched documents.Go to the PREFERENCES page by clicking the PREFERENCES button at the top right corner. You can see that the Word endings: option is set to whole word must match and the Case differences: option is set to ignore case differences.
- Sometimes we may want to ignore word endings while searching so as to match different variations of the term. Go to the PREFERENCES page and change the Word endings: option from whole word must match to ignore word endings. Click the set preferences button. Click Search. This time try searching for econom again. 9 documents are found.Please note that word endings are determined according to the third-party stemming tables incorporated in Greenstone, not by the user. Thus the searches may not do precisely what is expected, especially when cultural variations or dialects are concerned. Besides, not all languages support stemming, only English and French have stemming at the moment.Go to the PREFERENCES page and change back to whole word must match to avoid confusion later on. Click the set preferences button.
- Sometimes we may want to search for the exact term, that is, differentiate the upper cases from lower cases. Back in the PREFERENCES, set the Case differences: option from ignore case differences to upper/lower case must match. Click the set preferences button. Click Search. Now try searching for fao and FAO respectively this time, notice the difference in the results?Go back to the PREFERENCES page and change the Case differences: option back to ignore case differences to avoid confusion later on. Click set preferences button.
Use search mode hotkeys with query term
MGPP has several hotkeys for setting the search modes for a query term. These hotkeys explicitly set the Word endings: option and the Case differences: option for the query being constructed.
-
#s and #u are hotkeys for the Word endings: option. Appending #s to a query term will specifically enable the ignore word endings function. For example, try searching for econom#s. 9 documents are found, which is the same as in the previous section. Remember that we have set it back to whole word must match. This means using hotkeys will override the current preference settings.
- Appending #u to a query term will explicitly set the current search to whole word must match. Note that using hotkeys will only affect that query term. That is, hotkeys are used per term. For example, if a query expression contains more than one term, some terms can have hotkeys and others not, and the hotkeys can be different for different terms. This provides a fine-grained control of the query, whereas changing settings in the PREFERENCES page will affect the query as a whole.
- Hotkeys #i and #c control the case sensitivity. Appending #i to a query term will explicitly set the search to ignore case differences (i.e. case insensitive).
- In contrast, appending #c will specifically turn off the casefolding, that is, upper/lower case must match. For example, searching for fao#c returns 0 documents.
- Finally, the hotkeys can also be used in combination. For example, you can append #uc to a query term so as to match the whole term (without stemming) and in its exact form (differentiate upper cases and lower cases).For example, try searching for econom#si and compare against against the results when searching for econom#sc and for Econom#sc. The first search is case insensitive and the last two searches are both case sensitive.
A quick reference of the search mode hotkeys in MGPP
Word endings:
#s ignore word endings
#u whole word must match
Case differences:
#i ignore case differences
#c upper/lower case must match
Incremental building of a collection
Collections built with the
Lucene indexer support incremental addition, updates, and deletion of documents and metadata. By default, the
import and
build processes delete old index files in the
index directory and intermediate files in the
archives directory. With incremental building, the
import and
build process will keep the old files and only process the new or modified ones.
Incremental import can be done with any collection, but incremental modification of the indexes can only be done for collections that use the Lucene indexer.
The first part of this tutorial looks at using
The Depositor for incremental building.
The Depositor only supports addition of new documents and associated metadata. If you want to delete or modify existing documents and their metadata, you will need to use GLI or command line building.
The Depositor
The Depositor is Greenstone’s runtime support for institutional repositories. It provides the collection building work flow through a web interface.
The Depositor only works with the Web library server, not the local library server. Greenstone users belonging to the
all-collections-editor user group have access to
The Depositor.
Enabling The Depositor
For Windows users, first make sure that you are using a Web Server (e.g. Apache) instead of the Local Library Server. The binary installation of Greenstone will install Apache, but by default the Local Library Server will be used. To switch to using Apache,
rename the
GSDLHOME → server.exe file to something else. Then re-run the Greenstone Server, from the
Start → Greenstone Server menu if on Windows, followed by pressing its
Enter Library button. (On unix systems, run
./gs2-server.sh from Greenstone 2's installed location to start up the Greenstone server.)
Note: You might need to set permissions for the
GSDLHOME → tmp and
GSDLHOME → collect or
GSDLHOME → collect → your_accessible_collection directory.
In Greenstone,
The Depositor is disabled by default. To enable it, edit the file
GSDLHOME → etc → main.cfg. Look for the "depositor" line, and change
disabled to
enabled. Then save and close the file.
Setting a user group
Use of
The Depositor involves an authentication step. A user will need a Greenstone account which belongs to an appropriate user group. The
all-collections-editor user group gives access to edit any collection, while the ***-collection-editor group gives a user access to edit the *** collection, where *** is the collection's short name (or directory name). By default, the admin account is a member of the all-collections-editor group.
The Greenstone admin pages are used to add new users and modify their group settings. Admin pages may have been enabled when you installed Greenstone. If not, they can be activated by changing the "status" line in the
main.cfg file and changing
disabled to
enabled.
- To access the administration pages, go to your Greenstone home page when the Greenstone server is running and click the Administration Page (below the list of collections). To see the list of users, click the list users link on the left under User management section. You will need to sign in. You can use the admin account, or any other account which has been added to the administrator group. If you didn't set up the admin pages when you installed Greenstone, then a default admin account will be created with password "admin". Please change this immediately.
- Let's modify the groups for the demo user. This user was added for the authentication demonstration collection to allow restricted access to some of the documents. If this user doesn't exist for you, create a new user by clicking on the add a new user link under the User management section on the left. Give it the name "demo" and password "demo". Click submit. Back in the Administration Pages, click the list users link and the new user "demo" should be listed there now.
- We'll give this user access to modify the Demo Lucene collection that we will be using for this tutorial. If you have given the collection the title "Demo Lucene", then its short name is likely to be "demoluce". You can check this in GLI: Open the Demo Lucene collection, go to Format->General, and look for the collection folder item. Here we assume demoluce.
- In the list users page, at the end of each user entry there are two links: edit and delete. Click edit on the demo user account, and you will be shown more detailed information about the demo user. Add demoluce-collection-editor at the end of the groups line, using a comma to separate group entries, so that the groups field now contains:
demo,demoluce-collection-editor. (Note, if your lucene collection's shortname is not demoluce, then replace demoluce with the appropriate name in ***-collection-editor.)
- Click submit. Click the Greenstone home link on the left side and return to the Greenstone home page.
Use the Depositor to do incremental addition
- On the Greenstone library home page, click The Depositor button. You will see a drop-down selection list of all the available collections. Select Demo Lucene from the list and sign in with the demo account.
- The next page asks you to fill in the metadata fields — Title, Organization, Subject, Keyword and Language. These metadata fields are from the Development Library Subset (DLS) metadata set, which is the metadata set used in the Demo Lucene collection. In order to ensure the new document will be displayed in the classifiers, we will next specify these metadata for the new document.
The default metadata fields that would be displayed here for a new collection are the Title, Creator and Description from the Dublin Core Metadata Set. You can customize which metadata fields are required for items added through The Depositor in the Depositor Metadata section on the Format panel in the Greenstone Librarian Interface.
We are going to deposit this file: sample_files → demo_NewFiles → r9006e.htm. Double click r9006e.htm and have a look at its content. Type the following in the Title field:
Selected guidelines for the management of records and archives: a RAMP reader (r9006e)
(Note, You can copy this and the following metadata values across from the sample_files → demo_NewFiles → r9006e-metadata.txt).In the Organization field, type UNESCO
In the Subject field, type:
Communication, Information and Documentation|Records and Archives Management Programme (RAMP) of UNESCO, Archive Management
In the Keyword field, type:
manage records and archives
Finally in the Language field, type: English
- Click the Select File button. Click the Choose File button and select sample_files → demo_NewFiles → new → r9006e.htm, click the Confirmation button and check the document has been uploaded successfully.
- Click the Deposit Item button and wait for the process to finish. You will see the Collection built successfully. message if the collection has been built successfully, or error messages if something has gone wrong.
- Click View collection to preview the newly built collection and check that the newly added document is displayed correctly. For example, in the organizations classifier you should find a new bookshelf named UNESCO, which contains the new document.
Batch addition with the Depositor
-
The Depositor also supports batch addition of new documents. This is achieved by zipping up the new documents (together with their metadata files) and depositing the zip file. Please note that the collection must have ZIPPlugin in order to be able process the uploaded zip file, otherwise you'll first need to add the ZIPPlugin through the Librarian Interface.
- Go to the Greenstone's home page and click The Depositor button. Select Demo Lucene from the list and log in if asked to do so again.
- Leave the metadata fields blank, because the zip file we are adding contains metadata.xml files which specify these metadata values. Click the Select File button, select sample_files → demo_NewFiles → new_files.zip, which contains two new HTML documents along with their associated images and metadata.xml files. Click Confirmation and then the Deposit Item button.
- After the building is finished, click View collection to preview the collection. On the collection's home page, it says the collection now contains 14 documents. Check the titles classifier to see that the new documents Above and beyond and Utilization and construction of pit silos have been added successfully.
A major benefit of using
The Depositor is that the user can upload documents and metadata remotely, without having to have Greenstone installed at the client end.
The Depositor is a tool for remote data input, allowing you to also deposit items to collections built with the MG or MGPP indexers. The difference is that the MG and MGPP indexers need to rebuild the entire index after adding a new item, while the Lucene indexer incrementally adds the new document to the existing index.
Incrementally building a collection using the command line
To allow you to quickly try out and experiment with our tutorial exercises, we tend to keep the number of sample files small. Every time you rebuild these collections, for simplicity, the default settings used in Greenstone mean that the previous version built is removed in its entirety. We refer to this as a
full-rebuild. When building larger collections, this is inefficient.
Greenstone also has the ability to rebuild collections
incrementally: this means the previous version of the collection is retained and only the changes detected need to be incorporated. There are, however, quite a few aspects to incremental building to control. This is the focus of this tutorial exercise.
To gain the best level of understanding, this tutorial builds collections using the command line.
- In GLI, create a new collection called Incremental With Manifests and base it on the Greenstone demo collection. The short name of this collection will become incremen, and this will be the name of the collection's folder on the file system.
- Use GLI's Workspace view to navigate to this tutorial's sample files folder, incr_build. It will contain a folder named import. Open this. In GLI's Gather panel, drag and drop the 3 subfolders into your new collection. (You can also carry out this step using a file browser to copy the contents of the incr_build\import sample files folder into collect\incremen\import.) Go to the Design panel and select Search Indexes. Press the Change... button in the top right to change the indexer in use to Lucene.
- Do not build the collection in GLI. We'll be building and rebuilding manually, from the command-line terminal. So close GLI. You can choose to run the Greenstone server at any stage, however.
- In a text editor, open your incremen collection's collect.cfg file located in collect\incremen\etc.Change the OIDtype setting's value to full_filename, which means the identifiers generated and used by Greenstone for this collection's documents will be based on their full filenames (their filename appended to any containing directories relative to the collection's import folder). For any collection that you want to incrementally rebuild, make sure that it was similarly built with the OIDtype set to full_filename. A collection that is built with this setting will allow us to refer to the files by name in the <Filename> elements of any manifest file that we use to incrementally rebuild it. These <Filename> elements will then identify which files are to be indexed if newly added, and which are to be re-indexed, as should happen if a document or its metadata has been edited. (For specifying which files are to be deleted, the document identifier will be used instead of the filename.)
- Since this is the first time we're building our collection, we're going to do a complete build. And we'll use the command line to do so. Open a terminal. To open a terminal in Windows, press Ctrl+r and type cmd in the Run dialog that displays. To open a terminal on a Mac machine, click on menu Go → Utilities → Terminal. Use the terminal to cd into your Greenstone installation folder. For instance, if you have your Greenstone installed on Windows as "Greenstone" within your account folder at C:\Users\me, then type the following in your terminal and hit Enter:
cd C:\Users\me\Greenstone
On Linux or Macs, the general command is the same, but the installed location would be different and the slashes go the other way. For example, if installed in /Users/me/Greenstone3, you'd type the following and hit Enter:
cd /Users/me/Greenstone3
Now you're ready to set up the Greenstone environment in your terminal. On Windows, type the following into your terminal and hit Enter again:
setup.bat
On Linux and Mac:
source ./setup.bash
When using a terminal, you'll need to hit Enter after each command in order to execute the command you just finished typing. We won't repeat this instruction any more. Just remember to hit Enter after every complete command entered into a terminal.With the terminal now operating within your Greenstone installation folder, and with the Greenstone environment now set up and ready, type the following commands to do a complete build of your new collection. Although the command contains the word "rebuild" in it, since this is the first time the collection's being built, it will just build it.
perl -S full-rebuild.pl incremen
Preview the collection. If the Greenstone server is not running (as would happen if you had closed GLI and didn't start the standalone Greenstone Server Interface application), then run it from the Start Menu on Windows now. You could also run the Greenstone server by running the gs2-server.bat script in the terminal if using a Windows, or running the gs2-server.sh script from a Linux/Mac terminal.When previewing, try searching for "kouprey" and you should get results, as this term occurs in the document b18ase.For the rest of this tutorial exercise, leave open the terminal in which you have set up your Greenstone's environment. We'll be using it throughout.
Incrementally adding some additional new documents to a collection
- If you want, you can use GLI to drag and drop the fb33fe, fb34fe and wb34te folders, located in the incr_build/more-files subfolder of sample files, into your collection.Alternatively, you can use a File Browser to copy the folders fb33fe, fb34fe and wb34te, located in the incr_build/more-files sample files subfolder, into your collection's import folder at collect\incremen\import.The above step will only have gathered 3 new documents into your collection. However, since the changes have not been built, previewing at this stage will make no difference.
- We want to build just the newly added documents into the collection if possible, instead of rebuilding everything. This time, instead of running full-rebuild, we'll be running the incremental-import and incremental-buildcol scripts to perform the two phases of a Greenstone build operation incrementally, these being the import and buildcol phases. Incremental building allows us to (re)build just what is necessary, rather than everything.Since we know exactly which files have been added and thus which files need to be built, we can write a manifest file specifying this. The manifest files used by the Greenstone incremental building process are just XML files that can be created and edited in a plain text editor, and which indicate which files need to be (re)processed by a Greenstone incremental build operation.We've already prepared the manifest files we'll be using in this tutorial exercise for you. Use a File Browser to copy the manifests subfolder from the incr_build sample files into your incremen collection folder that's located inside your Greenstone installation directory (at collect\incremen).In a text editor, open the add-new-files.xml manifest file found in the newly copied manifests subfolder. Inspect the contents of this manifest file. It should contain:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Index>
<Filename>fb33fe/fb33fe.htm</Filename>
<Filename>fb34fe/fb34fe.htm</Filename>
<Filename>wb34te/wb34te.htm</Filename>
</Index>
</Manifest>
The above lists the 3 main documents to be added/indexed by Greenstone (hence the keyword <Index>). Since these documents are located inside their own subfolders when copied into the import folder, the manifest file also indicates the relative folder structure of these documents, e.g. "fb33fe/fb33fe.htm" shows that the fb33fe.htm HTML document is located in the folder fb33fe. Only the main documents to be added are listed, not the associated image files also found at the same folder level, as Greenstone will track down all the image files referred to by the main html documents to be indexed and will process them as files associated with the html.
- Return to the terminal you had left open. We can finally run the commands for the incremental build operation.Use the terminal to first run the incremental import stage:
perl -S incremental-import.pl -manifest manifests/add-new-files.xml incremen
Once that finishes running, start off the incremental buildcol stage of the build process:
perl -S incremental-buildcol.pl -activate incremen
The incremental import command specifies the manifest file that Greenstone is to consult in order to work out which files should be processed and how (Indexed, Deleted or Reindexed). By the builcol stage, the specific files would then be ready for further incremental processing by the buildcol script. The activate flag to the incremental buildcol script tells Greenstone to (re-)activate the updated collection if the Greenstone server is running.
- Preview the collection either by running the Greenstone Server Interface application, if it isn't already running, or by starting the Greenstone server from the command line with the command:
gsicontrol.bat web-start
(To stop the Greenstone server at any point, use the command gsicontrol.bat web-stop. To stop-and-start it, you'd use gsicontrol.bat web-restart. On Linux/Mac, use the equivalent script gsicontrol.sh for each command, e.g. ./gsicontrol.sh web-start.)When the server is runnning, preview your library home page, located by default at http://localhost:8282/greenstone/cgi-bin/library.cgi. Visit the Incremental with Manifests collection and click on the Titles browser. There should be 3 additional documents now, and you should be able to search for terms that occur in them. For example, searching for "groundnuts" should return results, since this term occurs in the newly added document fb33fe.
Incrementally deleting some documents from a collection
- Inspect the delete-some-files.xml manifest file (located in your increment collection folder's manifests subfolder). It contains:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Delete>
<OID>b18ase-b18ase_htm</OID>
<OID>fb33fe-fb33fe_htm</OID>
</Delete>
</Manifest>
As per the above manifest file, the operation to be performed by an incremental build is a <Delete> operation on two documents. For the delete operation, the documents are not indicated by the <Filename> XML element, but by the <OID> element which specifies the object identifier. We need to use the OID here because we're telling Greenstone precisely what the identifiers of the documents are that we wish to have removed from our collection. The identifiers of every built document in a Greenstone collection are specified in the Identifier field of the document's doc.xml file located in the collection's archives folder. The doc.xml file is the Greenstone-specific XML format in which Greenstone stores documents already imported.For instance, to find the identifier of the b18ase.htm document in your built collection, open up collect\incremen\archives\b18ase-b.dir\doc.xml in a text editor. Then scroll down, looking for a piece of Greenstone extracted metadata labelled Identifier, which is the OID for this document:
<Metadata name="Identifier">b18ase-b18ase_htm</Metadata>
The above value for the document identifier is what's used in the delete-some-files.xml manifest file to refer to this document. This document is one of two that are to be deleted as per the manifest file. Make sure to close the doc.xml file if you have it open.
- So then, let's first physically remove these two documents from our collection, so that the contents of the import folder match what the manifest specifies: use a file browser to remove the folders b18ase and fb33fe from the collection's import folder.
- Finally, let's incrementally rebuild the collection, specifying the manifest file that Greenstone should use this time to carry out the incremental build operation. As before, there are two steps.First run the modified incremental import command:
perl -S incremental-import.pl -manifest manifests/delete-some-files.xml incremen
When that has finished running, run the same incremental buildcol command as before (it doesn't change):
perl -S incremental-buildcol.pl -activate incremen
- When it has finished, preview the collection once more and check that the 2 documents have been removed. They should not turn up in the browse classifiers, nor in search results. For example, search for "kouprey" again. Check that when you search for the term this time, that no documents matched the query. (Since it only occurs in document b18ase, which has now been removed.)
Editing a document's text and metadata, and then incrementally rebuilding the collection
- Inspect the mod-text-and-meta.xml manifest file (located in incremen/manifests) in a text editor. It should contain:
<?xml version="1.0" encoding="UTF-8"?>
<Manifest>
<Reindex>
<Filename>fb34fe/fb34fe.htm</Filename>
<Filename>b20cre/b20cre.htm</Filename>
</Reindex>
</Manifest>
Note the <Reindex> used this time. It indicates which documents that are already in the collection are to be re-processed when the collection is incrementally rebuilt as per this manifest file.
- Open up the file fb34fe/fb34fe.htm of your incremen collection's import folder in a text editor and add, remove or change some text nested anywhere in between the HTML tags within the <BODY> tag. Be careful not to partially modify HTML element names or HTML entities (entities start with an ampersand, &, and end with a semi-colon, ;), as doing so can make your text contents invalid HTML.
Save and close the edited file.
- Start up GLI. Open the incremen collection and go to the Enrich panel. Add or modify dc.Title metadata for the b20cre document. Do not accidentally build the collection using GLI.
- Quit GLI.In the above two steps, we've modified the text contents of document fb34fe and the metadata associated with b20cre. Our mod-text-and-meta.xml manifest file already indicates that these two files are to be reindexed, so we can go ahead and incrementally rebuild the collection with this manifest file.
- Run the incremental rebuild operation to re-process just these two files. To do so, pass the mod-text-and-meta.xml manifest file this time.First run:
perl -S incremental-import.pl -manifest manifests/mod-text-and-meta.xml incremen
Followed by:
perl -S incremental-buildcol.pl -activate incremen
- Preview the collection once more. Check that the 2 documents contain your edits: try searching for any additional words you added. Also check the dc.Title metadata that you had modified can now be searched and appears as the title for the b20cre document in the Titles browsing classifier.
In this tutorial, we looked at cutting down the amount of time spent on rebuilding a collection by manually controlling the rebuild operation so that it processes only what has changed. We do so by means of a manifest that specifies exactly which files need to be rebuilt and how (whether they need to be Indexed, Deleted or Reindexed). Greenstone also has an automatic incremental rebuild feature, sparing you the need to specify a manifest file in the
import phase. Omitting the manifest argument in the above exercises activates this behaviour, however, this is typically slower, because Greenstone now needs to scan the entire
import folder and compare this with the information in the
archives folder to determine what has changed.
Now repeat all the above exercises in the same sequence once again, but with a new collection called
autoincr also based on the
Demo collection. But this time, don't pass in the manifest file as an argument to the
import.pl script. After each incremental build, preview your
autoincr collection to check that the Browsing classifiers contain the expected documents and that searching returns the expected results.
Incrementally indexing automatically
Just as there is the command
full-rebuild.pl to completely build a collection from scratch, there is also the command
incremental-rebuild.pl. The final exercise you have just completed could equally have been achieved by running:
perl -S incremental-rebuild.pl autoincr
For every collection, the
import phase can be run incrementally (either using a manifest file or automatically), however, the ability for the
buildcol phase to be incremental depends on the indexer in use. Lucene and Solr indexers support incremental indexing, but the MG and MGPP indexers do not. A warning is issued if you attempt to run the
buildcol phase incrementally when the chosen indexer does not support this.