Extracting Metadata from Non-XML Files

Many non-XML document formats, like PDF and Microsoft Office files, have the ability to carry metadata embedded within them. Titania Delivery can extract this metadata and use it for its own purposes.

There are a wide variety of metadata schemas used in non-XML content types. For metadata to be most useful in Titania Delivery, it should follow the same schema across all files. For this reason, Titania Delivery allows you to specify a mapping file that maps the metadata keys embedded in various file types to the metadata keys used within the system. These mapping files can be added to document types that are associated with a project, or with the project itself. Whether the mapping file exists in a project or document type, it must be placed at /HARP-META/mappings.conf. When extracting metadata from a piece of content, all mapping files available to the project and its associated doctypes will be used.

In addition to using embedded properties as metadata, Titania Delivery will also use any title metadata as the object's title for purposes of searching.

You can determine the metadata keys present on a given file by looking at its Details tab.

Metadata Mapping Config File Format

The HARP-META/mappings.conf file format is a simple text format describing mappings between embedded metadata names and Titania Delivery metadata names, grouped by file type. For example:

# Settings for all applicable file formats
[all]
dc:Title=title

# Settings for PDF files
[pdf]
cp:subject=Description
# This line will look in the metadata for the first key it finds,
# and use that value for 'author'.
dc:author|meta:author|pdf:author|Author|dc:creator|producer=author

If a metadata key is not represented in the settings file, it will not be copied into Titania Delivery.

There are three types of lines allowed in the mappings file.

Comments
Lines beginning with a pound sign (#) are treated as comments.
Groups
Lines that begin and end with square brackets ([]) are groups. The contents of the square brackets is either the file extension or the MIME type of the files for which the following mappings apply. For example, [pdf] rules will apply to any file with a .pdf extension, while [application/pdf] will match any content with that MIME type. The special group [all] can be used for mappings that apply for all non-XML content.
Mappings
All other lines are considered mapping lines. A mapping line consists of:
  1. The metadata key or keys from the source file. Multiple metadata keys can be combined with a pipe (|) character. If multiple keys are specified, the system will take the first metadata value it finds for one of the keys, scanning from left to right, and store it in Titania Delivery.
  2. An equal sign, =.
  3. The Titania Delivery metadata name to use for the file metadata key.

Only the metadata key is required. If no mapping is specified, the metadata name will be used as-is. For example:

[all]
title
description
dc:author=author

These rules will look for title and description metadata keys and store them with those names in Titania Delivery, while dc:author metadata will be stored as author.

If multiple keys are mapped to the same Titania Delivery metadata name, the unique values of those keys will be combined and stored as distinct values in the Titania Delivery metadata. For example:

description=description
dc:description=description

If a file contains both description and dc:description entries, and their values are different, both values will be stored in the description metadata entry in Titania Delivery. To force the system to choose one or the other, combine them with |, e.g.

dc:description|description=description

Specifying Delimiters for Multiple Values

Some systems make it difficult or impossible to create embedded properties with multiple values. In such cases, multiple values can be specified with all values concatenated together into a single value, separated by a delimiter. Delimiters are specified using the special group [delimiters]

In the [delimiters] group, the key to each entry is the Titania Delivery metadata name, and the value is the delimiter to use. The default entry, if any, will be applied to all metadata without a specific rule. If there is no default entry, then only those metadata fields with explicit rules will have their values split.

The delimiter value should be the exact character or string of characters used to separate values. The following special rules apply:

  • A value of \t, \n, or \s indicates tab characters, newline characters, or all whitespace characters, respectively.
  • A value beginning and ending with a forward slash character (/) indicates that the value is a Regular Expression.

Here is an example delimiter configuration:

[delimiters]
default=,
platform|audience=;
country=\n
author=/[;, ]/

This example specifies the following rules:

  • The default delimiter, applied to all metadata properties extracted from embedded properties, is a comma.
  • The delimiter for platform and audience metadata is a semicolon.
  • The country metadata will be split at newlines.
  • The author metadata will be split using a regular expression matching on semicolons, commas, or spaces.

Applicable File Types

The HARP-META/mappings.conf file will be supported with the following file formats:

  • PDF
  • Microsoft Office
  • RTF
  • MP3
  • MP4
  • PNG
  • JPG
  • TIFF
  • DWG
  • EPUB

Default Mappings

Titania Delivery includes a built-in pseudo-doctype that carries the default mappings file. That document type is called Non-XML Metadata Mappings. This doctype will automatically be added to new projects by default. The default mapping file's contents are as follows.

# This file defines mappings from embedded metadata in file formats like PDF,
# Microsoft Office, and some graphic formats, to the metadata keys to use
# for those values in Titania Delivery.
#
# Each rule consists of a embedded_name=td_name pair. If the embedded name
# and the TD name are identical, the TD name is optional. For example, to
# map the embedded 'title' metadata to the name 'title' in TD, the rule
# can simply be coded as "title".
#
# If multiple embedded metadata names contribute to a multi-valued TD metadata
# entry, provide multiple rules. For example:
#
#     a=z
#     b=z
#
# Will result in the TD metadata 'z' having both the values from 'a' and 'b'.
#
# To have a piece of embedded metadata copied into multiple TD metadata
# entries, again, provide multiple rules. For example:
#
#     a=x
#     a=y
#     a=z
#
# Will cause the embedded metadata 'a' to be copied to x, y, and z in TD.
#
# To populate a piece of metadata from any one of several metadata, but not
# all of them, specify each embedded metadata key in the same rule, separated
# by a pipe (|) character. For example:
#
#     a|b|c=d
#
# - Will use 'a' if it is present, ignoring 'b' and 'c'
# - Will use 'b' if it is present and 'a' is not, ignoring 'c'
# - Will use 'c' if neither 'a' nor 'b' are present.
#
# Rules for specific formats can be grouped beneath a [type] line, where the
# type is either the file extension or MIME type of the file type in quesiton.
# Rules occuring before any groups, or occurring in the [all] group, apply to
# all non-XML file types.

[all] # These mappings apply for all file types

title|Title|dc:title=title
subject|Subject|dc:subject|cp:subject=subject
Revision-Number|cp:revision=revision
author|Author|Last-Author|meta:author|meta:last-author=author
Creation-Date|created|meta:creation-date|dcterms:created=created
Last-Modified|modified|dcterms:modified=modified

# Duplicate language for both 'lang' and 'locale' metadata.
language|Content-Language=lang
language|Content-Language=locale