Greenstone tutorial exercise
Downloading over OAI
GLI can serve as an OAI client application: it can connect to a remote OAI server and retrieve metadata, even download documents. The tutorial Open Archives Initiative (OAI) collection did not obtain the data from an external OAI-PMH server. This missing step is accomplished either by running a command-line program or by using the Download panel in the Librarian Interface. This exercise explains how you would do this using both methods. In the previous exercise, we set up the Greenstone server to serve your Greenstone 3 collections over OAI. In this tutorial, we will use GLI to connect to that OAI server and download OAI metadata for the Simple image collection and even download its documents. The principle is the same if you wish to connect to other OAI servers.
Downloading using the Librarian Interface
- Quit any running Greenstone applications. Launch GLI. This should launch the Greenstone server as well, so that the OAI server is also up and running.
- In GLI, go to the Download panel. To the left, choose OAI as the Download Setting.
- On the right, set the Source URL field to contain the URL to your Greenstone OAI server. It would be of the form
(If you set up your Greenstone 3 server to operate over https, then adjust the above URL to have https as prefix and to contain the associated https port number instead.)Make sure that you can generally access this URL from your browser.
Visit the library home page, as this will load the greenstone collections, so that any associated files like images or pdf documents become accessible for download. (Without visiting the library home page, the collections would not be loaded and the images from the Simple Images collection, that we will be downloading below alongside the oai files, will not be available for download.)
- If the server is not running on localhost and your computer is behind a firewall or proxy server, you may need to edit the proxy settings in the Librarian Interface. Click the <Configure Proxy...> button. Switch on the Use proxy connection? checkbox. Enter the proxy server address and port number in the HTTP Proxy Host: and Port: boxes. Further, if you set up your Greenstone to run over https (or more generally, if you will be downloading from https URLs), tick the box labelled "No certificate checking for HTTPS downloads". Click <OK> to get back to the OAI section of the Download panel.
- If at this stage you were to press the <Server Information> (in the central row of buttons), a dialog will pop up with basic details about the OAI server. At the end, it will diplay the names of the sets available via that OAI Server. A setSpec and a setName property will be defined for each available set. In our example, backdrop (the Simple Image collection) would be listed as one of the setNames with its setSpec as backdrop. Press the <close> to close the Server Information dialog.
- Tick the Metadata prefix checkbox as well as the Restrict to set checkbox. For the latter, type the setSpec value of backdrop. Then tick Get document. Also tick Only include file types and include jpg in the list of comma separated values for it so that it becomes
Next, tick Max records and set it to 10. There will be 9 images in the collection, so we don't really need to set the Max records value, but this is a helpful feature that you can use when downloading from an OAI server.
- Finally, click <Download>, located beside the Server Information button. If you have set proxy information in Preferences..., a popup will ask for your user name and password. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing. GLI will download oai metadata and, because we have ticked the Get document checkbox, it will also be retrieving actual documents, but not more than 10, because of the limit of 10 that we've placed on the number of records to download.
- After a while, it will have finished downloading. Change to the Gather panel, and on the left-hand side, open up the Downloaded Files folder. This is where Greenstone stores files you downloaded using the Download panel. In this case, it will contain a folder wherein the oai metadata files and images that you've just downloaded from your own Greenstone OAI server is stored. These files can then be added to a collection, as will be covered later in this tutorial.
Downloading using the command line
For command line downloading to work, your computer must have a direct connection to the Internet—being behind a firewall may interfere with the ability to download the information. You will need to use the Librarian Interface for downloading if you are behind a firewall.
- Close the Librarian Interface.
- Start up the Greenstone server application.
Visit the library home page to load the greenstone collections including any associated files, as explained above.
- If you're on Windows, open a DOS window to access the command-line prompt. This facility should be located somewhere within your Start → Programs menu, but details vary between different Windows systems. If you cannot locate it, select Start → Run, enter cmd in the popup window that appears and hit Enter.If you're on Linux or Mac, open a terminal.
- Before you start, you must set up your Greenstone environment in the terminal. In the DOS window or terminal, move to the home directory where you installed Greenstone. This is accomplished by something like:
cd C:\Program Files\Greenstone
to set up the ability to run Greenstone command-line programs. On Linux/Mac, you would run source gs3-setup.sh.
- If you set up your Greenstone to run over https or intend to use the command line to download from any URLs that begin with https instead of http, then you will further need to edit your Greenstone 3 installation's gs2build/bin/linux/wgetrc file as follows. Open the file in a text editor and change the line that says:
#check_certificate = off
check_certificate = off
Removing the hash sign at the start of this line changes it from being a mere comment to activating the line. Save the edited file and close it. The effect of this step will be that downloading from https URLs will now succeed even when download commands are run from the command line.
GLI uses a perl script, downloadfrom.pl, to do the downloading. This can be run on the command line, outside of GLI.
The downloadfrom.pl script can download using several different protocols. These are specified using the -download_mode option. To see the available options for download mode, run perl -S downloadfrom.pl -h. This shows that the current options are: Web, MediaWiki, OAI, Z3950, SRW. For OAI downloading, use -download_mode OAI.
To see the options for downloading using the OAI mode, you can run perl -S downloadinfo.pl OAIDownload. The options are the same as you can see in the GLI OAI download panel.
- We'll use the set and max_records OAI Download options to limit the number of OAI records downloaded from the backdrop collection at your Greenstone's OAI server again:
perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone3/oaiserver -set backdrop -max_records 15
The OAI records will be downloaded into the folder where the downloadfrom.pl script is run from. To change this, use the -cache_dir full-path-to-folder option and set its value to the full path of the destination folder you choose. (If you wanted to download the documents along with the records, then you would additionally pass in the -get_doc flag to the above command as well as the -get_doc_exts flag followed by a comma-separated list of file extensions like "jpg,pdf".)
perl -S downloadfrom.pl -download_mode OAI -url http://<hostname:portnumber>/greenstone3/oaiserver -set backdrop -max_records 15 -get_doc -get_doc_exts "jpg,pdf" -cache_dir "<type-full-path-to-a-download-folder>"
You can import the downloaded documents into a new Greenstone collection and build them in the usual manner.
Building the downloaded documents in GLI
- If you used GLI to download documents over OAI, as seen in the first part of the tutorial, you can find the downloaded items in the Downloaded Files folder in the filesystem view on the left side of the Gather panel.If you used the command line to download documents, the downloaded files will be stored wherever you ran the downloadfrom.pl script from.
- Open GLI, locate the files you downloaded over OAI and drag and drop these into a new Greenstone collection called OAI Collection.
- Go to the Design panel, and configure the OAIPlugin by ticking its no_cover_image option. Generally, Greenstone will look for any images that have an identical name to the primary document being processed and will associate the image with the document as being the document's cover image. Because the OAI files and the image documents downloaded over OAI have matching names, each image would get treated as the cover image for its associated OAI file. We don't want that behaviour here, so we turn on the no_cover_image option. This allows the OAIPlugin to attach the metadata of each OAI file with its associated image (treating the image as the primary document, instead of as a cover image), just as intended.Note that this time, we don't configure the OAIPlugin's document_field option to ex.dc.Identifier, because the OAI files that have been downloaded over OAI have the associated image's document identifier stored in the (ex.)gi.Sourcedoc metadata field. You can see this if you open up any of the downloaded OAI files in a text editor. The (ex.)gi.Sourcedoc field is consulted by default when the Greenstone building process tries to identify what source document to attach the metadata in each OAI file to.
- Switch to the Create panel and press the build button. During this stage, the OAIPlugin will extract the metadata in the oai files and attach them to the associated jpg files of the downloaded backdrop collection. You can see this once the collection has been built by switching to the Enrich panel and clicking on an oai file, as no metadata is set for such files. However, if you then click on a jpg file and scroll down, there will be metadata names that start with ex.dc. This refers to Greenstone-extracted Dublin Core metadata. ex.dc.Description and ex.dc.Title will be set to the values you had assigned the images in the tutorial A Simple Image Collection. Greenstone will have added additional ex.dc metadata in the form of ex.dc.Identifier, which is the source URL for this image.
- If you wish, you can now set up this collection in a manner similar to how the backdrop collection was set up in A simple image collection. Don't forget to copy in any specific format statements, adjust them to use the ex.dc metadata instead of dc metadata, then rebuild and preview the collection.