Greenstone tutorial exercise
Downloading files from the web
The Greenstone Librarian Interface's Download panel allows you to download individual files, parts of websites, and indeed whole websites, from the web.
- Start a new collection called webtudor, and base it on -- New Collection --.
- In a web browser, visit https://englishhistory.net, follow the link to The Tudors. You should be at the URL
This is where we started the downloading process to obtain the files you have been using for the tudor collection. You could do the same thing by copying this URL from the web browser, pasting it into the Download panel, and clicking the <Download> button. However, several megabytes will be downloaded, which might strain your network resources—or your patience! For a faster exercise we focus on a smaller section of the site.
- Go to the Download panel by clicking its tab. There are five download types listed on the left hand side. For this exercise, we only use the Web type. Make sure this is selected in the list.Enter this URL
into the Source URL box. There are several other options that govern how the download process proceeds. To see a description of an option, hover the mouse over it and a tooltip will appear. To copy just the citizens section of the website, switch on the Only files below URL option by checking its box and set the Download Depth option to 1. If you don't do this (or if you miss out the terminating "/" in the URL), the downloading process will follow links to other areas of the englishhistory.net website and grab those as well. Also switch on the Only files within site option to avoid downloading any items on the site pages that actually emanate from outside it (like google ads).
- If your computer is behind a firewall or proxy server, you will need to edit the proxy settings in the Librarian Interface. Click the <Configure Proxy...> button. Switch on the Use proxy connection? checkbox. Enter the proxy server address and port number in the HTTP Proxy Host: and Port: boxes.URLs that start with https, or URLs that resolve to https, will additionally need the HTTPS Proxy Host: and corresponding Port: filled in too, before web pages can be downloaded from there.Websites at https URLs often have a security certificate, but not always. For instance, https://englishhistory.net does not have one. To instruct GLI to nevertheless download pages from https URLs that don't have a security certificate, you'll also need to switch on the No certificate checking for HTTPS downloads checkbox.Once you've finished configuring the proxy settings, click <OK> to close the dialog.
- Now click <Download>. If you have set proxy information in Preferences..., a popup will ask for your user name and password. If you're on Windows Vista or later, Windows may show a popup message asking whether you wish to block or unblock the download. In such a case, choose to unblock. With proxy settings turned on, it may take a short while before GLI starts downloading. Once the download has started, a progress bar appears in the lower half of the panel that reports on how the downloading process is doing.
More detailed information can be obtained by clicking <View Log>. The process can be stopped altogether by clicking <Close>. Downloading can be a lengthy process involving multiple sites, and so Greenstone allows additional downloads to be queued up. When new URLs are pasted into the url box and <Download> clicked, a new progress bar is appended to those already present in the lower half of the panel. When the currently active download item completes, the next is started automatically.
- Downloaded files are stored in a top-level folder called Downloaded Files that appears on the left-hand side of the Gather panel. You may not need all the downloaded files, and you choose which you want by dragging selected files from this folder over into the collection area on the right-hand side, just like we have done before when selecting data from the sample_files folder. In this example we will include everything that has been downloaded.Select the englishhistory.net folder within Downloaded Files and drag it across into the collection area. Once you've dropped the folder into the collection area, you may see popup dialogs, one for each file extension that is not recognised by GLI. Either keep clicking <OK> to confirm for each unrecognised filetype, or, in the popup, you can tick the checkbox to not see the same message again.
- Switch to the Create panel to build and preview the collection. It is smaller than the previous collection because we included only the citizens files. However, these now represent the latest versions of the documents.