Greenstone tutorial exercise
Enhanced PDF handlingGreenstone converts PDF files to HTML using third-party software: pdftohtml.pl. This lets users view these documents even if they don't have the PDF software installed. Unfortunately, sometimes the formatting of the resulting HTML files is not so good. This exercise explores some extra options to the PDF plugin which may produce a nicer version for display.
- In the Librarian Interface, start a new collection called "PDF collection" and base it on -- New Collection --.In the Gather panel, drag just the PDF documents from sample_files → Word_and_PDF → Documents into the new collection. Also drag in the PDF documents from sample_files → Word_and_PDF → difficult_pdf.Go to the Create panel and build the collection. Examine the output from the build process. You will notice that one of the documents could not be processed. The following messages are shown: "The file pdf05-notext.pdf was recognised but could not be processed by any plugin.", and "3 documents were processed and included in the collection. 1 was rejected".
- Preview the collection and view the documents. pdf05-notext.pdf does not appear as it could not be processed. pdf06-weirdchars.pdf was processed but looks very strange. The other PDF documents appear as one long document, with no sections.
Modes in the Librarian Interface
The Librarian Interface can operate in different modes. The default mode is Librarian mode. We can use Expert mode to work out why the pdf file could not be processed.
- Use the Preferences... item on the File menu, Mode tab, to switch to Expert mode and then build the collection again. The Create panel looks different in Expert mode because it gives more options: locate the <Build Collection> button, near the bottom of the window, and click it. Now a message appears saying that the file could not be processed, and why. Amongst all the output, we get the following message: "Error: PDF contains no extractable text. Could not convert pdf05-notext.pdf to HTML format". pdftohtml.pl cannot convert a PDF file to HTML if the PDF file has no extractable text.
- We recommend that you switch back to Librarian mode for subsequent exercises, to avoid confusion.
Splitting PDFs into sections
- In the Document Plugins section of the Design panel, configure PDFPlugin. Switch on the use_sections option. In the Search Indexes section, ensure that both the section and document boxes are checked. This will build the indexes on both the section level and the document level.
Build and preview the collection. View the text versions of some of the PDF documents.
Note that these are now split into a series of pages, and two means of jumping between various pages is provided: on the left, individual pages are listed vertically by page number and clicking the "plus" box next to a page will expand its contents, while on the right there's a box with a horizontal scroller which can be used to scroll to the page you wish to view.
Note that pdf05-notext.pdf is still not processed.
Using image format
- If conversion to HTML doesn't produce the result you'd like, PDF documents can be converted to a series of images, one per page. This requires ImageMagick and Ghostscript to be installed.
- In the Document Plugins section, configure PDFPlugin. Set the convert_to option to one of the image types, e.g. pagedimg_jpg. Switch off the use_sections option, as it is not used with image conversion.
Build the collection and preview.
All PDF documents (including pdf05-notext.pdf) have been processed and divided into sections.
Images from the document are now displayed instead of the extracted text. Both pdf05-notext.pdf and pdf06-weirdchars.pdf display nicely now.
Using process_exp to control document processing (advanced)
- Processing all of the PDF documents using an image type may not give the best result for your collection. The images will look nice, but as no text is extracted, searching the full text will not be available for these documents. The best solution would be to process most of the PDF files as HTML, and only use the image format where HTML doesn't work.
- We achieve this by putting the problem files into a separate folder, and adding another PDFPlugin plugin with different options.
- Go to the Gather panel. Make a new folder called "notext": right click in the collection panel and select New folder from the menu. Change the Folder Name to "notext", and click <OK>.Move the two pdf files that have problems with html (pdf05-notext.pdf and pdf06-weirdchars.pdf) into this folder by drag and drop. We will set up the plugins so that PDF files in this notext folder are processed differently to the other PDF files.
- Switch to the Document Plugins section of the Design panel. Add a second PDF plugin by selecting PDFPlugin from the Select plugin to add: drop-down list, and clicking <Add Plugin...>. This plugin will come after the first PDF plugin, so we configure it to process PDF documents as HTML. Set the convert_to option to html, and switch on the use_sections option. Click <OK>.
- Configure the first PDF plugin, and set the process_exp option to "notext.*\.pdf".
- The two PDF plugins should have options like the following:
plugin PDFPlugin -convert_to pagedimg_jpg -process_exp "notext.*\.pdf"
The paged_img version must come earlier in the list than the html version. The process_exp for the first PDFPlugin will process any PDF files in the notext directory. The second PDFPlugin will process any PDF files that are not processed by the first one.Note that all plugins have the process_exp option, and this can be used to customize which documents are processed by which plugin.
plugin PDFPlugin -convert_to html -use_sections
- Build and preview the collection. All PDF documents should look relatively nice. Try searching this collection. You will be able to search for the PDFs that were converted to HTML (try e.g. "bibliography"), but not the ones that were converted to images (try searching for "FAO" or "METS").