Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: A collection of Word and PDF files
Sample files: pdfbox.zip
Devised for Greenstone version: 2.85|3.06
Modified for Greenstone version: 2.86|3.08

Processing newer versions of PDF with PDFBox

By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone is included in a Greenstone 3 binary and allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF.

  1. Launch GLI in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu.

  1. Create a new collection called newpdfs and drag and drop the PDF file in sample_files → pdfbox into here. The version of this PDF file is newer than what PDFPlugin can handle by default, but with the PDFBox extension installed, this file can now be processed. Also drag in the older PDF sample_files → Word_and_PDF → Documents → pdf03.pdf into the collection.

  1. Since the PDFBox extension (which works with the PDFPlugin) now comes installed with Greenstone 3, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension, go to the Design panel, select Document Plugins from the left, and on the right double click the PDFPlugin (alternatively, select this plugin and click the <Configure Plugin...> below) to open the dialog to configure this plugin. In the Configure Plugin... dialog, scroll down to the section AutoLoadConverters and select the checkbox next to the pdfbox_conversion option. Click OK to close the dialog, switch to the Create panel and build your collection. This time, the PDF files will be processed by PDFBox which will extract their text.

    Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the pdfbox_conversion option turned on.


Copyright © 2005-2016 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”