Greenstone tutorial exercise
Using the UnknownConverterPlugin to make unsupported document formats searchable
This is an advanced tutorial, in that it not only supposes you have familiarised yourself with most of what you've learned in preceding tutorials, but that you're also comfortable with downloading and installing software from the web, and have a little familiarity with using image editing software.
The UnknownConverterPlugin builds on the idea of the UnknownPlugin, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.
The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own PC that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder, you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. It will launch the commandline conversion tool with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.
An example scenario would be if your collection contained djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available that can convert from djvu to one of the text based formats that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.
Working with DjVu documents in Greenstone
DjVu (pronounced like the French phrase déjà vu) is a document format suited for archiving digital documents. DjVuLibre, which provides open source tools for processing DjVu documents, describes DjVu as
"a web-centric format and software platform for distributing documents and images. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consume less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy re-rendering. DjVu is used by hundreds of academic, commercial, governmental, and non-commercial web sites around the world."
In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the ocrodjvu package and use its djvu2hocr tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using djvutxt which is part of the DjVuLibre suite of tools and which is also available for other operating systems like Windows.
Extracting the text from DjVu documents with DjVuLibre's djvutxt
- Start up GLI and create a new collection called DjVu Collection.
- Visit the 'DjVu-Digital vs. "Super Hero" PDF' page. The page compares a PDF sample document to its equivalent DjVu version and provides download links for both.Download their sample DjVu document into your DjVu Collection's import folder at Greenstone → web → sites → localsite → collect → djvucoll → import.
- Back in GLI, in the Collection view of the Gather pane, right click and select Refresh folder view. You should now see your new document "superhero.djvu" ready to be built.
- Head over to the Create pane and build the collection. The document isn't recognised. You can press Preview to confirm that there's nothing much to look at in this collection.If you were to search through the Design pane's Document Plugins for a "DjVuPlugin", you wouldn't find one, because Greenstone hasn't got one. Greenstone knows about a lot of common formats, but there's a great many formats that different people like to work with that Greenstone knows nothing about and which Greenstone developers have not created a custom plugin for.
You've already learnt about the UnknownPlugin in the Multimedia tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon building, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.
This is where the UnknownConverterPlugin comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offers the additional advantage of being able to extract the text of the unknown document, depending on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.
- So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to automatically run this commandline tool for us, so Greenstone can take care of the rest.We're in luck, because among the DjVu related tools that DjVuLibre provides is one called "djvutxt" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:
- The next step is to find out how to run DjVuLibre's djvutxt conversion tool from the commandline.The general format of the command is
djvutxt input.djvu output.txt
Open a DOS prompt on Windows or a terminal on Mac/Linux and experiment to see what it takes to convert your Greenstone installation's web/sites/localsite/collect/DjVuColl/superhero.djvu file.You may have to invoke djvutxt using its full filepath, in which case on Windows the command would look like:
C:\PATH\TO\YOUR\djvutxt C:\PATH\TO\YOUR\GS\web\sites\localsite\collect\DjVuColl\superhero.djvu C:\PATH\TO\YOUR\GS\superhero.txt
while on Unix systems the command would look like:
/PATH/TO/YOUR/djvutxt /PATH/TO/YOUR/GS/web/sites/localsite/collect/DjVuColl/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt
Once you have the command working, inspect the output file. You should see mostly legible text in it. Only when you've been able to successfully complete this step should you proceed to the next steps.
Processing DjVu documents with the UnknownConverterPlugin
- Now that you know how to run the djvutxt conversion tool from the commandline, open up the DjVu Collection in GLI. Go into the Design pane's Document Plugins section and add a new UnknownConverterPlugin instance. (There's already one in the Document Plugin pipeline, but it is not set up for processing djvu files.) Press <Configure Plugin...> and set up the plugin as follows:
If you have any spaces in any filepaths in your exec_cmd, make sure to always nest that entire filepath in escaped double quotes (\"), so Greenstone can preserve the spaces in it.If any filepaths, other than %%INPUT_FILE and %%OUTPUT are within your Greenstone installation, you can use the %%GSDLHOME, %%GSDL3SRCHOME and %%GSDL3HOME (the latter for Greenstone 3's web folder) as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's ext subfolder, then you would start the filepath to djvutxt with %%GSDL3SRCHOME/ext.The value for your exec_cmd field may look something like the following, if you have DjVuLibre installed in C:\Program Files. Note the escaped double quotes bookending the path to djvutxt, to protect spaces in its filepath:
- set its convert_to field to text
- set its mime_type field to image/vnd.djvu, which is one of the mime types for the DjVu format
- set its process_extension to djvu
- Finally, copy the full djvutxt command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's exec_cmd field. Keep the full path to the djvutxt binary, but replace the entire input filepath with the literal string %%INPUT_FILE and replace the output filepath with the literal string %%OUTPUT. Doing so means that when you build the collection, Greenstone will replace %%INPUT_FILE with each DjVu document in your collection that it needs to process, and will replace %%OUTPUT with the expected text output file of each document upon conversion by djvutxt.
\"C:\Program Files\DjVuLibre\djvutxt\" %%INPUT_FILE %%OUTPUT
- Having sufficiently configured the UnknownConverterPlugin, click on the OK button close the plugin's Configuration dialog. Move to the Create pane and build the collection. Your document has now been recognised. What's more, if you preview it and search for the term "Interoperability", a term that occurs in our collection's superhero.djvu document, you should now get a search result linking to that document. So Greenstone has successfully indexed the document's text, thanks to DjVuLibre's djvutxt tool extracting the text which got fed into the rest of Greenstone's building pipeline.
Associating an icon with DjVu documents in Greenstone
- When previewing the search result, you may notice that there's no proper icon for the document superhero.djvu. The Greenstone extracted text variant of the document has an icon, a plain text one. However, the superhero.djvu has the "unknown document format" icon, the one with the question mark on it. We can change this.
- Go back to the Design pane to configure your UnknownConverterPlugin once more. This time, enable the srcicon field and set its value to icondjvu.This is a macroname we're just inventing, though we're following existing Greenstone convention in naming document icon macros, in that it's of the form "icon<file-extension>".Click OK to close the UnknownConverterPlugin configuration dialog. Quit GLI, since there's a little more work to do.
- Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the Wikipedia page for it.Save one of their DjVu icon images. Then open the image in Windows Paint or GIMP or another image editor, and use the application's scaling feature to scale the image's height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "idjvu.gif", storing it in your Greenstone installation's web/interfaces/default/images folder.
- Greenstone knows nothing about the icondjvu macro we defined as the value for UnknownConverterPlugin's srcicon field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's web/sites/localsite/siteConfig.xml file.Locate the line
<replace macro="_iconunknown_" scope="metadata" text="<img src='interfaces/default/images/iunknown.gif' border='0'/>" resolve="false"/>
Add a similar line above or below it and adjust it to say:
<replace macro="_icondjvu_" scope="metadata" text="<img src='interfaces/default/images/idjvu.gif' border='0'/>" resolve="false"/>
Save the file.The above has now associated the icon image we want appearing for the djvu document with the macro we defined for the srcicon field in UnknownConverterPlugin's configuration.
- Restart GLI, which will restart the Greenstone server, reloading the siteConfig.xml you have just edited. Rebuild the DjVu Collection again and preview it. This time, when you browse and search the collection, you should see the djvu icon appearing in place of the unknown icon for your DjVu document.
- Having designed your collection to handle DjVu documents, you can now add any other documents, including more DjVu documents. Greenstone should now be able to index the text content of DjVu documents in the collection to make them searchable, in all instances where text can be successfully extracted from them by djvutxt.