Controlling PDF Processing
Titania Delivery uses a software library from the Apache Tika project to extract text from PDF files.
Some PDF files have strings of text whose characters are duplicated. By default, these strings will appear with the duplicated characters when extracted, like "WWiirree" for "Wire". These strings are indexed for searching, which can reduce the accuracy of searches. They can also be displayed as text fragments in search results display.
If you notice these symptoms, you can apply the following metadata either on the item or at the project level.
_td.PDFsuppressDuplicateText
-
If set to "true", duplicate characters will be suppressed from extraction. The default behavior is "false".
Note: Enabling this behavior may result in some legitimate strings being omitted from the extraction process. It may also increase the processing time.
These metadata can be set at either the project or file level. If set at both the project and the file level, the file-level metadata will take precedence.