Greenstone tutorial exercise
Processing newer versions of PDF with PDFBox
By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone is included in a Greenstone 3 binary and allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF.
- Launch GLI in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu.
- Create a new collection called newpdfs and drag and drop the PDF file in sample_files → pdfbox into here. The version of this PDF file is newer than what PDFPlugin can handle by default, but with the PDFBox extension installed, this file can now be processed. Also drag in the older PDF sample_files → Word_and_PDF → Documents → pdf03.pdf into the collection.
- Since the PDFBox extension (which works with the PDFPlugin) now comes installed with Greenstone 3, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension, go to the Design panel, select Document Plugins from the left, and on the right double click the PDFPlugin (alternatively, select this plugin and click the <Configure Plugin...> below) to open the dialog to configure this plugin. In the Configure Plugin... dialog, scroll down to the section AutoLoadConverters and select the checkbox next to the pdfbox_conversion option. Click OK to close the dialog, switch to the Create panel and build your collection. This time, the PDF files will be processed by PDFBox which will extract their text.Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the pdfbox_conversion option turned on.