Greenstone tutorial exercise
Using the UnknownConverterPlugin to make unsupported document formats searchable
This is an advanced tutorial, in that it not only supposes you have familiarised yourself with most of what you've learned in preceding tutorials, but that you're also comfortable with downloading and installing software from the web, and have a little familiarity with using image editing software.
The UnknownConverterPlugin builds on the idea of the UnknownPlugin, in that it can be configured to handle documents of unknown format and file extension. It can also be made to handle documents with known file extensions in a custom manner.
The UnknownConverterPlugin extends the UnknownPlugin's abilities by letting you launch a tool you have installed on your own PC that can be run from the commandline to convert from the "unknown" file format to either text, html or gif/jpg/png images, or a folder of these. If you know how to launch this tool from the commandline to do the conversion, then you would configure the UnknownConverterPlugin by supplying the file format (file extension) of the documents it should process, the expected output file format (text, html or paged images), and the tool's conversion command that the UnknownConverterPlugin should launch to perform the conversion. In place of the input file and the output file or folder, you provide placeholders in the command to run. Once configured, the UnknownConverterPlugin will be used during building to process documents that match the specified file format. It will launch the commandline conversion tool with the command provided, and the expected output files as specified can then be processed by Greenstone in the usual manner.
An example scenario would be if your collection contained djvu files, for which Greenstone provides no custom plugin. However, there's a free commandline tool available that can convert from djvu to one of the text based formats that Greenstone can process, text or html. So in this case, you could try using the UnknownConverterPlugin with the commandline tool on djvu files that you've gathered. The result should be that the djvu files in your Greenstone collection are now searchable.
Working with DjVu documents in Greenstone
DjVu (pronounced like the French phrase déjà vu) is a document format suited for archiving digital documents. DjVuLibre, which provides open source tools for processing DjVu documents, describes DjVu as
"a web-centric format and software platform for distributing documents and images. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consume less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy re-rendering. DjVu is used by hundreds of academic, commercial, governmental, and non-commercial web sites around the world."
In this part of the tutorial we'll see how to get Greenstone to not just include a collection's DjVu documents, but make them searchable too. There are several tools out there to convert a DjVu document into text or HTML. For instance, Linux users can install the ocrodjvu package and use its djvu2hocr tool to extract the text content in HTML format. Janusz S. Bien, a Greenstone user on the mailing list, has recommended it as being of possible use to Greenstone users, as it's a front-end to OCR programs. In this tutorial, however, we'll look at using djvutxt which is part of the DjVuLibre suite of tools and which is also available for other operating systems like Windows.
Extracting the text from DjVu documents with DjVuLibre's djvutxt
- Start up GLI and create a new collection called DjVu Collection.
- Visit the 'DjVu-Digital vs. "Super Hero" PDF' page. The page compares a PDF sample document to its equivalent DjVu version and provides download links for both.Download their sample DjVu document (originally here) into your DjVu Collection's import folder at Greenstone → web → sites → localsite → collect → djvucoll → import. If you're offline, you can also get this file from sample_files → djvu → superhero.djvu.
- Back in GLI, in the Collection view of the Gather pane, right click and select Refresh folder view. You should now see your new document "superhero.djvu" ready to be built.
- Head over to the Create pane and build the collection. The document isn't recognised. You can press Preview to confirm that there's nothing much to look at in this collection.If you were to search through the Design pane's Document Plugins for a "DjVuPlugin", you wouldn't find one, because Greenstone hasn't got one. Greenstone knows about a lot of common formats, but there's a great many formats that different people like to work with that Greenstone knows nothing about and which Greenstone developers have not created a custom plugin for.
You've already learnt about the UnknownPlugin in the Multimedia tutorial and know that it can be configured to process document formats for which Greenstone has no custom plugin. However, UnknownPlugin cannot index textual document formats that are unknown to Greenstone to make them searchable upon building, because it doesn't know anything about their internal structure and consequently doesn't know how to extract their text content.
This is where the UnknownConverterPlugin comes in. It builds on the idea of the UnknownPlugin, allowing you to work with document formats unknown to Greenstone. But it offers the additional advantage of being able to extract the text of the unknown document, depending on an important proviso: that you have a software tool installed on your machine, one that can be run readily from the commandline, which can perform the process of converting the unknown document format into text or HTML (or a series of images). If the tool can convert the document to text or HTML, Greenstone can proceed as usual to index the content to make it searchable on previewing.
- So in order to process the "superhero.djvu" document in our collection, such that its text content gets indexed for searching, we need to do a number of things: find out if there's a free djvu to text conversion tool out there, work out how to run it from the commandline and finally configure the UnknownConverterPlugin to automatically run this commandline tool for us, so Greenstone can take care of the rest.We're in luck, because among the DjVu related tools that DjVuLibre provides is one called "djvutxt" that can perform the text extraction for us. DjVuLibre is available for Windows, Mac and Linux:
- DjVuLibre provides binary installers for Windows and Mac. Grab the one for your operating system and install it somewhere sensible: somewhere you have permissions to install and run it from. On Windows, running the installer in the regular manner requires you to have admin permissions. If you don't have admin rights, you can run the installer as follows (instructions taken from this superuser exchange) to install DjVuLibre in a non-admin location. Use a text editor to create a file called nonadmin.bat (beware the file doesn't end up with an additional .txt extension when saving it). Copy and paste, or carefully type, the following text into the file, then save and close it:
cmd /min /C "set __COMPAT_LAYER=RUNASINVOKER && start "" %1"
Next, open up a File Explorer and drag and drop the DjVuLibre setup executable icon onto the icon of the new nonadmin.bat file, to run setup in a way that bypasses the admin privileges usually required for a successful installation. When installing, you'll now finally be allowed to choose a custom install directory, instead of the installer choosing an off-limits admin location like C:\Program Files (x86) for you. So make sure to choose a location in your User area as install directory. Upon successful installation, you're given the option to launch DjVuLibre's DjView tool, which will open the DjVuLibre manual (in djvu format). In the left pane of DjView, you can see a listing of the various tools DjVuLibre is comprised of, and read up on them. You can also read about djvutxt or the other DjVu tools that DjVuLibre provides in their documentation page, but for this tutorial, we'll just be using their djvutxt tool.
- To install the DjVu binary on the Mac: double click on the downloaded dmg file, open the loaded dmg's contents in Finder, then copy the DjView.app into your Mac's Applications folder. You can achieve the same through the command line with a command similar to the following line, which instructs the OS to copy across the DjView.app in your opened dmg file (of the djvu installer) into the Mac Applications folder:
cp -r /Volumes/DjVuLibre-3.5.28+DjView-4.12-universal-2/DjView.app /Volumes/Macintosh\ HD/Applications/.
- Some Linux machines may even come pre-installed with DjVuLibre. If not, you can use a package manager to install it for you, or compile it up easily from source in the usual Unix manner, as explained below.
- If you're on a Unix (Linux or Mac) system where you don't have the permissions needed to install DjVuLibre, yet you do have the requisite compiler installed (gcc, g++ on Linux, Xcode on Mac), as well as having "autoconf" installed on the Mac, then you can compile it up from source code as follows: download the source tarball and untar this in a user location. Open a terminal and change directory into the untarred DjVuLibre source folder. Then run the following three commands in sequence, adjusting the prefix flag to start with the full path to your Greenstone installation:
./configure --prefix=/PATH/TO/YOUR/GS/djvulibre
make
make install
If compiling was successful, djvulibre binaries would have been generated inside your Greenstone installation's new djvulibre/bin folder, at /PATH/TO/YOUR/GS/djvulibre/bin. For this tutorial, the most important of these djvulibre binaries is djvuxt, which will now be located at /PATH/TO/YOUR/GS/djvulibre/bin/djvutxt on your Unix system.
- The next step is to find out how to run DjVuLibre's djvutxt conversion tool from the commandline.The general format of the command is
djvutxt input.djvu output.txt
Open a DOS prompt on Windows (to do so: press the letter r while holding down the Windows key—the key located between the Alt and Ctrl keys on your keyboard— to launch the Windows Run popup dialog, then type cmd into it and hit Enter), or a terminal on Mac/Linux and experiment to see what it takes to convert your Greenstone installation's web/sites/localsite/collect/DjVuColl/superhero.djvu file.You may have to invoke djvutxt using its full filepath, in which case on Windows the command would look like:
C:\PATH\TO\YOUR\djvutxt C:\PATH\TO\YOUR\GS\web\sites\localsite\collect\DjVuColl\import\superhero.djvu C:\PATH\TO\YOUR\GS\superhero.txt
while on Unix systems the command would look like:
/PATH/TO/YOUR/djvutxt /PATH/TO/YOUR/GS/web/sites/localsite/collect/DjVuColl/import/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt
If you're on a Mac and had installed DjView.app into your Mac Applications folder, then the command you run in the Mac Terminal would look something like:
/Volumes/Macintosh\ HD/Applications/DjView.app/Contents/MacOS/djvutxt /PATH/TO/YOUR/GS/web/sites/localsite/collect/DjVuColl/import/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt
If you compiled up djvulibre from source, djvutxt will be in /PATH/TO/YOUR/djvulibre/bin/djvutxt.Once you have the command working, inspect the output file. You should see mostly legible text in it. Only when you've been able to successfully complete this step should you proceed to the next steps.
Processing DjVu documents with the UnknownConverterPlugin
- Now that you know how to run the djvutxt conversion tool from the commandline, open up the DjVu Collection in GLI. Go into the Design pane's Document Plugins section and add a new UnknownConverterPlugin instance. (There's already one in the Document Plugin pipeline, but it is not set up for processing djvu files.) Press <Configure Plugin...> and set up the plugin as follows:
- set its convert_to field to text
- set its mime_type field to image/vnd.djvu, which is one of the mime types for the DjVu format
- set its process_extension to djvu
- Finally, copy the full djvutxt command you ran from the commandline and paste it into the UnknownConverterPlugin Configuration dialog's exec_cmd field. Keep the full path to the djvutxt binary, but replace the entire input filepath with the literal string %%INPUT_FILE and replace the output filepath with the literal string %%OUTPUT. Doing so means that when you build the collection, Greenstone will replace %%INPUT_FILE with each DjVu document in your collection that it needs to process, and will replace %%OUTPUT with the expected text output file of each document upon conversion by djvutxt.
If you have any spaces in any filepaths in your exec_cmd, make sure to always nest that entire filepath in escaped double quotes (\"), so Greenstone can preserve the spaces in it.If any filepaths, other than %%INPUT_FILE and %%OUTPUT are within your Greenstone installation, you can use the %%GSDLHOME, %%GSDL3SRCHOME and %%GSDL3HOME (the latter for Greenstone 3's web folder) as placeholders and write out your filepaths relative to this. For instance, if your DjVuLibre is installed in your Greenstone's ext subfolder, then you would start the filepath to djvutxt with %%GSDL3SRCHOME/ext.The value for your exec_cmd field may look something like the following, if you have DjVuLibre installed in C:\Program Files. Note the escaped double quotes bookending the path to djvutxt, to protect spaces in its filepath:
\"C:\Program Files\DjVuLibre\djvutxt\" %%INPUT_FILE %%OUTPUT
On Unix systems, adjust the command you ran in the command line to now leave out any backslash protecting spaces in the command's filepaths, but ensure you have escaped double-quotes around such filepaths containing spaces. For example,
/Volumes/Macintosh\ HD/Applications/DjView.app/Contents/MacOS/djvutxt /PATH/TO/YOUR/GS/web/sites/localsite/collect/DjVuColl/import/superhero.djvu /PATH/TO/YOUR/GS/superhero.txt
becomes
\"/Volumes/Macintosh HD/Applications/DjView.app/Contents/MacOS/djvutxt\" %%INPUT_FILE %%OUTPUT
- Having sufficiently configured the UnknownConverterPlugin, click on the OK button close the plugin's Configuration dialog. Move to the Create pane and build the collection. Your document has now been recognised. What's more, if you preview it and search for the term "Interoperability", a term that occurs in our collection's superhero.djvu document, you should now get a search result linking to that document. So Greenstone has successfully indexed the document's text, thanks to DjVuLibre's djvutxt tool extracting the text which got fed into the rest of Greenstone's building pipeline.
Associating an icon with DjVu documents in Greenstone
- When previewing the search result, you may notice that there's no proper icon for the document superhero.djvu. The Greenstone extracted text variant of the document has an icon, a plain text one. However, the superhero.djvu has the "unknown document format" icon, the one with the question mark on it. We can change this.
- Go back to the Design pane to configure your UnknownConverterPlugin once more. This time, enable the srcicon field and set its value to icondjvu.This is a macroname we're just inventing, though we're following existing Greenstone convention in naming document icon macros, in that it's of the form "icon<file-extension>".Click OK to close the UnknownConverterPlugin configuration dialog. Quit GLI, since there's a little more work to do.
- Greenstone doesn't have an icon for DjVu documents, since it doesn't know about the format. If you Google for the djvu icon, you'd probably find the Wikipedia page for it.Save one of their DjVu icon images. Then open the image in Windows Paint or GIMP or another image editor, and use the application's scaling feature to scale the image's height or the width (whichever is greater) to anywhere between 26 and 32 pixels. Save the scaled image as a GIF file with the name "idjvu.gif", storing it in your Greenstone installation's web/interfaces/default/images folder. You can also use free online image resizing websites to carry out this step.If you're working offline, you can get a resized and ready copy of the idjvu.gif file from sample_files → djvu → idjvu.gif. Put it in your Greenstone 3 installation's web/interfaces/default/images folder.
- Greenstone knows nothing about the icondjvu macro we defined as the value for UnknownConverterPlugin's srcicon field, so we have to teach Greenstone about this new macro. Use a text editor to open your Greenstone 3's web/sites/localsite/siteConfig.xml file.Locate the line
<replace macro="_iconunknown_" scope="metadata" text="<img src='interfaces/default/images/iunknown.gif' border='0'/>" resolve="false"/>
Add a similar line above or below it and adjust it to say:
<replace macro="_icondjvu_" scope="metadata" text="<img src='interfaces/default/images/idjvu.gif' border='0'/>" resolve="false"/>
Save the file.The above has now associated the icon image we want appearing for the djvu document with the macro we defined for the srcicon field in UnknownConverterPlugin's configuration.
- Restart GLI, which will restart the Greenstone server, reloading the siteConfig.xml you have just edited. Rebuild the DjVu Collection again and preview it. This time, when you browse the collection, you should see the djvu icon appearing in place of the unknown icon for your DjVu document.
- Having designed your collection to handle DjVu documents, you can now add any other documents, including more DjVu documents. Greenstone should now be able to index the text content of DjVu documents in the collection to make them searchable, in all instances where text can be successfully extracted from them by djvutxt.Make the search format statement look like below (you can copy it from sample_files → djvu → formats → format_tweaks.txt), then try searching:
<gsf:template match="documentNode">
<td valign="top">
<gsf:link type="document">
<gsf:icon type="document"/>
</gsf:link>
</td>
<td valign="top">
<gsf:link type="source">
<gsf:choose-metadata>
<gsf:metadata name="thumbicon"/>
<gsf:metadata name="srcicon"/>
</gsf:choose-metadata>
</gsf:link>
</td>
<td>
<gsf:link type="document">
<xsl:call-template name="choose-title"/>
</gsf:link>
<gsf:switch>
<gsf:metadata name="equivDocLink"/>
<gsf:when test="exists">
Also available as: <gsf:metadata name="equivDocLink"/><gsf:metadata name="equivDocIcon"/><gsf:metadata name="/equivDocLink"/>
</gsf:when>
</gsf:switch>
</td>
</gsf:template>