Greenstone tutorial exercise
Advanced scanned image collection
In this exercise we build upon the collection created in the Scanned image collection exercise. We add a new newspaper by creating an item file for it, add a new newspaper using the extended XML item file format, and modify the formatting.
Adding another newspaper to the collection
Another newspaper has been scanned and OCRed, but has no item file. We will add this newspaper into the collection, and create an item file for it.
- In the Librarian Interface, open up the Paged Image collection that was created in exercise Scanned image collection if it is not already open (File → Open...).
- In the Gather panel, add the folder sample_files → niupepa → new_papers → 12 to your collection. Inside the 12 folder you can see that there are 4 images and 4 text files.
- Create an item file for the collection. Have a look at an existing item file to see the format. Start up a text editor (e.g. WordPad) to open a new document. Add some metadata. The Title for this newspaper is "Te Haeata 1859-1862". The Volume is 3, Number is 6, and the Date is "18610902". (Greenstone's date format is yyyymmdd.) Metadata must be added in the form:
<Metadata name>Metadata value
For this document, the metadata looks like:
<Title>Te Haeata 1859-1862
- For each page, add a line in the file in the following format:
For example, the first page entry would look like
Note that if there is no text file, you can leave that space blank. You need to add a line for each page in the document. Make sure you increment the page number as well as the image number for each line. (The full text for this file can be copied from sample_files → niupepa → formats → 12_3_6.item.)
- Save the file using Filename 12_3_6.item, and save as a plain text document. (If you are using Windows, make sure the file doesn't accidentally end up getting saved as 12_3_6.item.txt.) Back in the Gather panel of the Librarian Interface, locate the new file in the Workspace tree, and drag it into the collection, adding it into the 12 folder.
Build the collection and preview. Check that your new document has been added.
XML based item file
There are two styles of item files. The first, which was used in the previous section, uses a simple text based format, and consists of a list of metadata for the document, and a list of pages. This format allows specification of document level metadata, and a single list of pages.The second style is an extended format, and uses XML. It allows a hierarchy of pages, and metadata specification at the page level as well as at the document level. In this section, we add in two newspapers which use XML-based item files.
- In the Gather panel, add the folder sample_files → niupepa → new_papers → xml (you need to add the xml folder, not the 23 folder) to your collection.
- Open up the file xml → 23 → 23__2.item and have a look at the XML. This is Number 2 of the newspaper titled Matariki 1881. The contents of this document have been grouped into two sections: Supplementary Material, which contains an Abstract, and Newspaper Pages, which contains the page images (and OCR text).
Build and preview the collection. The xml style items have been included.