Controlling PDF Processing

Titania Delivery automatically extracts full text and metadata from PDF files. Some aspects of this extraction can be controlled using special metadata values.

Titania Delivery uses a software library from the Apache Tika project to extract text from PDF files.

Some PDF files have strings of text whose characters are duplicated. By default, these strings will appear with the duplicated characters when extracted, like "WWiirree" for "Wire". These strings are indexed for searching, which can reduce the accuracy of searches. They can also be displayed as text fragments in search results display.

If you notice these symptoms, you can apply the following metadata either on the item or at the project level.

_td.PDFsuppressDuplicateText

If set to "true", duplicate characters will be suppressed from extraction. The default behavior is "false".

Note: Enabling this behavior may result in some legitimate strings being omitted from the extraction process. It may also increase the processing time.

These metadata can be set at either the project or file level. If set at both the project and the file level, the file-level metadata will take precedence.

Important: After modifying PDF processing metadata, all affected files must be re-processed for the changes to take effect.

Titania Delivery Administrator's Guide

Controlling PDF Processing

Related information: