Greenstone tutorial exercise
A large collection of HTML files—Tudor
You will need the files in the sample_files → tudor folder.
- Invoke the Greenstone Librarian Interface (from the Windows Start menu) and start a new collection called tudor (use the File menu), based on the default -- New Collection --.
- In the Gather panel, open the tudor folder in sample_files.
- Drag englishhistory.net from the left-hand side to the right to include it in your tudor collection. (This material is from Marilee Hanson's Tudor England Collection at https://englishhistory.net/tudor/, distributed with her permission.)
- Switch to the Create panel and click <Build Collection>.
- When building has finished, preview the collection.
Extracting more metadata from the HTML
- The browsing facilities in this collection (titles
and filenames) are based entirely on extracted metadata. Switch to the Enrich panel in the Librarian Interface and examine the metadata that has been extracted for some of the files.
- Many HTML documents contain metadata in <meta> tags in the <head> of the page. Open up the englishhistory.net → tudor → monarchs → boleyn.html file by navigating to it in the tree on the left hand side, and double clicking it. This will open it in a web browser. View the HTML source of the page (View → Source in Internet Explorer, Tools → Web Developer → Page Source in Mozilla, and press Ctrl+U in Microsoft Edge). You will notice that this page has page_topic, content and author metadata.
- By default, HTMLPlugin only looks for Title metadata. Configure the plugin so that it looks for the other metadata too. Switch to the Design panel and select the Document Plugins section. Select the plugin HTMLPlugin line and click <Configure Plugin...>. A popup window appears. Switch on the metadata_fields option, and set the value to
- Switch to the Create panel and rebuild the collection. Go back to the Enrich panel and look at the extracted metadata for some of the HTML files in englishhistory.net → tudor → monarchs. The new metadata should now be visible.
Looking at different views of the files in the Gather and Enrich panels
- Switch to the Gather panel and on the right-hand side open englishhistory.net → tudor.
- Change the Show Files menu for the right-hand side from All Files to HTM & HTML. Notice the files displayed above are filtered accordingly, to show only files of this type.
- Change the Show Files menu to Images. Again, the files shown above alter.
- Now return the Show Files setting back to All Files, otherwise you may get confused later. Remember, if the Gather or Enrich panels do not seem to be showing all your files, this could be the problem.