Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: A collection of Word and PDF files
Sample files: pdfbox.zip
Devised for Greenstone version: 2.85|3.06
Modified for Greenstone version: 2.86|3.07

Processing newer versions of PDF with PDFBox

By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone is included in a Greenstone 3 binary and allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF.

Before you can use the extension, make sure that all instances of GLI, the Greenstone Librarian interface, are closed.

Note that if you were running GLI through a console, you will want to start up a fresh console, then run the setup script again to set up the Greenstone environment once more, which will this time take the presence of the PDFBox extension into account. To run the setup script, your console needs to be pointing to your Greenstone installation directory. From here, you would run setup.bat if you're on Windows, or source ./setup.bash if you're on Linux.

Launch GLI once more, in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu.

Create a new collection called newpdfs and drag and drop the PDF file in sample_files → pdfbox into here. The version of this PDF file is newer than what PDFPlugin can handle by default, but with the PDFBox extension installed, this file can now be processed. Also drag in the older PDF sample_files → Word_and_PDF → Documents → pdf03.pdf into the collection.

Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension, go to the Design panel, select Document Plugins from the left, and on the right double click the PDFPlugin (alternatively, select this plugin and click the <Configure Plugin...> below) to open the dialog to configure this plugin. In the Configure Plugin... dialog, scroll down to the section AutoLoadConverters and select the checkbox next to the pdfbox_conversion option. Click OK to close the dialog, switch to the Create panel and build your collection. This time, PDF files will be processed by PDFBox which will extract their text.
Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the pdfbox_conversion option turned on.

Copyright © 2005-2015 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”