By means of an external preprocessor, documents can be modified before they are indexed. This makes it possible to convert binary data to text, or to generate or extract meta data (from images, for example) for the purpose of indexing. As a result, searches will (better) find the documents concerned. You can define as many preprocessors as you require.
Documents of any MIME type can be associated with a preprocessor. This can be done by means of
the indexing
section in the system configuration. Any suitable program can be used as an external
preprocessor. Optionally, arguments can be passed to such a program.
The preprocessor program receives the document to be indexed via stdin
from the
Search Server. The document passed to the preprocessor is a serialized XML document. The
preprocessor modifies it in the desired way and returns it to the Search Server via
stdout
. The Search Server then indexes the modified document. An example:
Original data:
<ses-indexDoc docId="2148" collection="cm-contents" mimeType="application/vnd.ms-excel"> <title encoding="plain">Ein Beispiel mit Excel-Daten</title> <keyword encoding="plain">Beispiel</keyword> <blob encoding="stream" mimeType="application/vnd.ms-excel"> /Fiona_671/instance/default/tmp/externalPreprocessor/1.dat </blob> </ses-indexDoc>
Modified data:
<ses-indexDoc docId="2148" collection="cm-contents" mimeType="application/vnd.ms-excel"> <title encoding="plain">Excel-Daten als Text</title> <keyword encoding="plain">Beispiel</keyword> <blob encoding="stream" mimeType="text/plain"> /Fiona_671/instance/default/tmp/text_data.dat </blob> </ses-indexDoc>
The XML document contains the fields to be indexed (the names of the XML elements) as well as
their values (the content of the XML elements). A field value may either be contained directly in
the element's content (encoding: plain
) or it may have been encoded. The encoding can
be determined by means of the encoding
tag attribute of the field element. Its value
can be one of:
plain
: The field value is the content of the XML
element.base64
: The field value can be determined by
base64-decoding the content of the XML element.stream
: The field value is contained in the file
whose path is specified in the content of the XML element.For all encodings except plain
the MIME type of the document is
provided as the value of the mimeType
tag attribute of the field element. If the MIME
type is changed during preprocessing, the mimeType
attribute must be set to the MIME
type of the resulting field value. If the encoding is not plain
, a field value will
only be indexed if its MIME type matches text/*
. In other words: if a preprocessor
produces base64-encoded or streamed field values, it must set their MIME type to a text type.
The preprocessor to be used, the MIME types to which it is applied, and the arguments to be
passed to it can be specified in the indexing.xml
configuration file. The corresponding
section might look like this, for example:
... <contentPreprocessors type="list"> <preprocessor> <mimeTypes type="list"> <mimeType>application/pdf</mimeType> </mimeTypes> <processor type="external"> bin/tclsh </processor> <processorArguments type="list"> <argument>/Fiona_671/instance/default/script/custom/pdf2TxtWrapper.tcl</argument> </processorArguments> </preprocessor> ... </contentPreprocessors> ...
Here, the Tcl interpreter was specified as the preprocessor program to use. To this program the
name of the script to be executed is passed as an argument
in the
processorArguments
element. Since the script cannot be loaded during server startup, it
should not be placed into the serverCmds
or clientCmds
directory.
The following sample script, pdf2TxtWrapper.tcl
, demonstrates how a PDF document,
which is containd as the blob
field in the XML document, can be read and converted to
text. Please note that no preprocessor is required for the Search Server to index PDF documents.
# Libraries package require dom package require base64 proc safeInterp {args} {} source [file join [file dirname [info script]]\ ../../../share/script/common/clientCmds/util.tcl] # Read Data set xmlRequest [read stdin] # Parse XML set docNode [::dom::DOMImplementation parse $xmlRequest] set rootNode [::dom::document cget $docNode -documentElement] # Select and handle element "blob" set blobElement [lindex [::dom::selectNode $rootNode descendant::blob] 0] array set attributes [array get [$blobElement cget -attributes]] set blobTextNode [$blobElement cget -firstChild] if {$blobTextNode ne ""} { set value [$blobTextNode cget -nodeValue] if {$value ne ""} { switch $attributes(encoding) { plain { # shouldn't happen with pdf set blob $value } base64 { set blob [::base64::decode $value] } stream { set blobFile $value } } set deletePdfFile 0 if {![info exists blobFile]} { set blobFile "/tmp/convert_me_[pid].pdf" writeFile $blobFile $blob set deletePdfFile 1 } set textFile "/tmp/converted_[pid].txt" # convert using ps2ascii if {![catch { exec ps2ascii $blobFile $textFile }]} { # modify the dom tree $blobTextNode configure -nodeValue $textFile ::dom::element setAttribute $blobElement mimeType "text/plain" ::dom::element setAttribute $blobElement encoding stream } if {$deletePdfFile} { file delete -force $blobFile } } } set xmlToReturn [string trimright [::dom::DOMImplementation serialize $docNode] "\n"] set lines [split $xmlToReturn "\n"] if {[string match "<!D*" [lindex $lines 1]]} { set xmlToReturn [join [lreplace $lines 1 1] "\n"] } # return the (modified) xml data puts -nonewline $xmlToReturn