Greenstone tutorial exercise
Processing newer versions of PDF with PDFBox
By default the PDFPlugin can process PDF versions 1.4 and older. The PDFBox extension for Greenstone allows text from more recent PDF files to be extracted. The extension uses PDFBox, an open-source PDF conversion tool. This tutorial will cover how to install the PDFBox extension for Greenstone and how to switch on its functionality in the Greenstone Librarian Interface to process text from newer versions of PDF.
- The wiki release notes that go with the Greenstone binary you downloaded will contain the download link to the PDFBox extension that works with your binary. If you want to try the most up-to-date version of the extension, copy the link http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.zip and paste it into the address bar of a browser window. Then download the zip archive from the page that loads, if you're in Windows. If you are working on a *nix machine, you might instead prefer to download the compressed tar file of the same by copying and pasting the link http://trac.greenstone.org/browser/gs2-extensions/pdf-box/trunk/pdf-box-java.tar.gz into your browser.
- Move the downloaded file into your Greenstone installation's ext folder.
- You will now need to decompress the file you downloaded in this location.To do so on Windows XP, rightclick on the file and choose Extract All... and go through the Extraction wizard. On Windows Vista and 7, double clicking on the zip file will open an Explorer window showing you its contents. Click on an empty part inside that window and choose Extract All... to extract its contents. On Linux, to decompress the tar.gz file, run the command:
tar -xvzf <tar file name>
All going well, you will have a folder called pdf-box inside your Greenstone's ext folder.
- Before you can use the extension, make sure that all instances of GLI, the Greenstone Librarian interface, are closed.
Note that if you were running GLI through a console, you will want to start up a fresh console, then run the setup script again to set up the Greenstone environment once more, which will this time take the presence of the PDFBox extension into account. To run the setup script, your console needs to be pointing to your Greenstone installation directory. From here, you would run setup.bat if you're on Windows, or source ./setup.bash if you're on Linux.
- Launch GLI once more, in the manner you're accustomed to. On Windows, the easiest way is the shortcut to GLI available through the Windows Start menu.
- Create a new collection called newpdfs and drag and drop the PDF file in sample_files → pdfbox into here. The version of this PDF file is newer than what PDFPlugin can handle by default, but with the PDFBox extension installed, this file can now be processed. Also drag in the older PDF sample_files → Word_and_PDF → Documents → pdf03.pdf into the collection.
- Now that you've installed the PDFBox extension, this will be available as an option in the plugin's configuration dialog. To turn on the PDFBox extension, go to the Design panel, select Document Plugins from the left, and on the right double click the PDFPlugin (alternatively, select this plugin and click the <Configure Plugin...> below) to open the dialog to configure this plugin. In the Configure Plugin... dialog, scroll down to the section AutoLoadConverters and select the checkbox next to the pdfbox_conversion option. Click OK to close the dialog, switch to the Create panel and build your collection. This time, the PDF files will be processed by PDFBox which will extract their text.Try this feature out on a collection of recent PDF files, by configuring its PDFPlugin with the pdfbox_conversion option turned on.