Greenstone tutorial exercise

Back to wiki
Back to index
Sample files: Word_and_PDF.zip
Devised for Greenstone version: 3.09
Modified for Greenstone version: 3.11

Enhanced PDF handling

Prior to Greenstone 3.09, Greenstone shipped with a plugin called PDFPlugin. It was the plugin Greenstone used to convert PDF files to HTML using the third-party software pdftohtml.pl. PDFPlugin allowed users to view PDF documents even if they didn't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files was not so good. Earlier versions of this tutorial would provide some instruction on extra options to the PDFPlugin for producing a nicer version for display. The older pdftohtml process could however not cope with much newer versions of PDF unless PDFPlugin's pdfbox_conversion option was switched on.

Starting with Greenstone 3.09, some older pdf processing functionality has been restructured into PDFv1Plugin, while shifting the pdfbox_conversion option into PDFv2Plugin. PDFv2Plugin further makes use of third-party software xpdf-tools, which better copes with newer PDFs, thus no longer requiring activating the pdfbox_conversion option when dealing with newer PDFs. PDFv2Plugin comes with several new preconfigured settings to produce output files in html, text, image or image and text formats, that can better reflect the appearance of an input PDF document's pages. Behind the scenes, PDFv2Plugin is configured to use the third-party xpdf-tools or pdfbox software for each output setting.

From Greenstone 3.09 onwards, PDFv2Plugin is added to a new collection's Document Plugins pipleline by default, in place of the now defunct PDFPlugin. In any instance where you particularly prefer the original PDFPlugin's HTML output for a PDF, you can now use PDFv1Plugin instead, as it still retains this functionality.

In the Librarian Interface, start a new collection called "PDF collection" and base it on -- New Collection --.
In the Gather panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf.
In the Document Plugins section of the Design panel, you should find PDFv2Plugin in the plugins list (in place of the deprecated PDFPlugin that would have been present in the plugins list in older versions of Greenstone).
Go to the Create panel and build the collection. Examine the output from the build process.
If you had built the same collection with PDFv1Plugin instead of PDFv2Plugin, the build output would inform you that one of the documents could not be processed and you'd have seen the following building messages: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 documents were processed and included in the collection. 1 was rejected".
However, since you built the collection of 4 pdfs with PDFv2Plugin, you will notice that all 4 documents could be processed.

Preview the collection and view the documents. Inspect pdf01 and pdf03 first. There's a table of contents is provided to the right. Clicking on a page in the table of contents will scroll to that page. Another way of navigating can be found to the left, where individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents. The pdfs have been sectionalised into groups of 10 pages, each group further containing a section for each individual page. If your pdf contained 10 or fewer pages, there won't be two levels of sectionalising, just one.
If you visit a given page and try to select and copy the text, you can. These are not entirely images of the pdf's pages (like screenshots of a pdf page), but are HTML pages that combine images of the background of each pdf page with the actual text of that page superimposed. The latter is what makes the text selectable.
If you return to GLI's Design pane and double click on PDFv2Plugin in Document Plugins, then you will see that the convert_to option is set to paged_pretty_html. This is the default PDF convert_to type and produces the kind of sectionalised HTML pages consisting of background images and superimposed text that you see with pdf01 and pdf03.

Next preview pdf05-notext.pdf. This is also similarly sectionalised, but the text is not selectable. That's because the original PDF file pdf05-notext.pdf contained no text, only images of text.

Now preview pdf06-weirdchars.pdf. Although also sectionalised, its contents look very strange. The reason for this will become apparent if you open the original document by double-clicking pdf06-weirdchars.pdf in GLI's Gather pane. Then in the open PDF, select as much of the text on its first page as possible. Copy that text and paste it in a text editor. You should see strange characters. This is why Greenstone's PDFv2Plugin wasn't able to extract legible text either.
Although Greenstone has processed all 4 documents, pdf06-weirdchars.pdf can be made to look better.

Using image format

PDF documents can be converted to a series of images, one per page. This uses the bundled ImageMagick and Ghostscript.

In the Document Plugins section, configure PDFv2Plugin. Set the convert_to option to one of the image types, e.g. pagedimg_jpg.

Build the collection and preview. All PDF documents have been processed again, still divided into a series of page sections, but this time one image per page. Images from the document are now displayed instead of the extracted text. That means there's no selectable text for any of the 4 documents this time. The table of contents on the right now displays a horizontal scroller containing thumbnails of each page. pdf06-weirdchars.pdf displays nicely now.

Using process_exp to control document processing (advanced)

Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.

We achieve this by putting the problem files into a separate folder, and adding another PDFv2Plugin plugin with different options.

Go to the Gather panel. Make a new folder called "notext": right click in the collection panel and select New folder from the menu. Change the Folder Name to "notext", and click <OK>.
Note: To see the right click context menu on the Mac, you would hold down the Ctrl key while pressing the (right) mouse button. If attempting to right click on the Mac does not produce any context menu, go into your Mac's Apple menu → System Preferences → Mouse and then tick the Secondary click box and then try right clicking the document in GLI as described.
Move pdf06-weirdchars.pdf (which has problems with html) and also pdf05-notext.pdf (which has no extractable text) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files.

Switch to the Document Plugins section of the Design panel. Add a second instance of PDFv2Plugin by selecting PDFv2Plugin from the Select plugin to add: drop-down list, and clicking <Add Plugin...>. This plugin will come after the first PDFv2Plugin instance, so we configure it to process PDF documents as sectionalised HTML by leaving the convert_to option on the default, paged_pretty_html. Click <OK>.

Configure the first PDF plugin, and set the process_exp option to "notext.*\.pdf".

The two PDF plugins should have options like the following:

plugin PDFv2Plugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf" plugin PDFv2Plugin -convert_to paged_pretty_html

The paged_img version must come earlier in the list than the html version. The process_exp for the first PDFPlugin will process any PDF files in the notext directory. The second PDFPlugin will process any PDF files that are not processed by the first one.
Note that all plugins have the process_exp option, and this can be used to customize which documents are processed by which plugin.

Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. "bibliography"), but not the ones that were converted to images (try searching for "FAO" or "METS").

Customising the table of contents section heading display

In the table of contents (on the right), a section number and section title are displayed by default. For documents like these where the section titles are the same as the section numbers, this doesn't make much sense, as you end up with headings like "1 1". We can hide the section number from the display by adding some CSS style information.

Click on the display format statement in the Format Features list. Add the following to the start of the content:

<gsf:template name="additionalHeaderContent-collection"> <style>span.tocSectionNumber { display: none; }</style> </gsf:template>

Note that if you'd rather hide the title instead, you can use span.tocSectionTitle in the above CSS code instead of span.tocSectionNumber.

Opening PDF files with query terms highlighted

Next we'll customize the search format statement to highlight the query terms in a PDF file when it is opened from the search result list. This requires Acrobat Reader 7.0 version or higher and has been confirmed to work with Firefox browsers on Windows, Linux and Mac systems, but is known to not work with Safari browsers at present. Other browsers are as yet untested and may or may not support the PDF query term highlighting syntax used in this exercise.

To highlight the query terms in a PDF document, we need to pass them into the PDF file by appending #search="query" to the end of the document link. We need to create the link ourselves rather than using <gsf:link type="source"/> in the format statement.
PDFPlugin saves each PDF file in a unique directory for that document, and we can use

<gsf:metadata name="httpPath" type="collection"/>/index/assoc/<gsf:metadata name="archivedir"/>/<gsf:metadata name="srclinkFile"/>

to refer to the PDF source file. The search terms can be found in the "q" cgi parameter. You can access this using <gsf:cgi-param name="q"/>.

Select search in Format Features for editing. We need to test whether the file is a PDF file before linking to it, using a test on whether the Greenstone extracted FileFormat metadata is PDF. For PDF files, we now generate the link explicitly.
The resulting format statement is:

<td valign="top"> <gsf:link type="document"> <gsf:icon type="document"/> </gsf:link> </td> <td valign="top"> <gsf:switch> <gsf:metadata name="FileFormat"/> <gsf:when test="equals" test-value="PDF"> <a><xsl:attribute name="href"><gsf:metadata name="httpPath" type="collection"/>/index/assoc/<gsf:metadata name="archivedir"/>/<gsf:metadata name="srclinkFile"/>#search=&quot;<gsf:cgi-param name="query"/>&quot;</xsl:attribute> <gsf:choose-metadata> <gsf:metadata name="thumbicon"/> <gsf:metadata name="srcicon"/> </gsf:choose-metadata> </a> </gsf:when> <gsf:otherwise> <gsf:link type="source"> <gsf:choose-metadata> <gsf:metadata name="thumbicon"/> <gsf:metadata name="srcicon"/> </gsf:choose-metadata> </gsf:link> </gsf:otherwise> </gsf:switch> </td> <td valign="top"> ...

When the PDF icons are clicked in the search results, Acrobat will open the file with the search window open with the query terms highlighted.

For example, Preview and try searching for bibliography. Click on a PDF icon in the search results. The PDF will be opened on the page with the first instance of the word bibliography, with the word highlighted.

Copyright © 2005-2019 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”