Extracting Metadata from Non-XML Files
There are a wide variety of metadata schemas used in non-XML content types. For metadata
to be
most useful in Titania Delivery, it should follow the same schema across all files.
For this
reason, Titania Delivery allows you to specify a mapping file that maps the metadata
keys
embedded in various file types to the metadata keys used within the system. These
mapping
files can be added to document types that are associated with a project, or with the
project
itself. Whether the mapping file exists in a project or document type, it must be
placed at
/HARP-META/mappings.conf
. When extracting metadata from a piece of
content, all mapping files available to the project and its associated doctypes will
be
used.
In addition to using embedded properties as metadata, Titania Delivery will also use
any
title
metadata as the object's title for purposes of searching.
You can determine the metadata keys present on a given file by looking at its Details tab.
Metadata Mapping Config File Format
The HARP-META/mappings.conf
file format is a simple text format
describing mappings between embedded metadata names and Titania Delivery metadata
names,
grouped by file type. For example:
# Settings for all applicable file formats [all] dc:Title=title # Settings for PDF files [pdf] cp:subject=Description # This line will look in the metadata for the first key it finds, # and use that value for 'author'. dc:author|meta:author|pdf:author|Author|dc:creator|producer=author
If a metadata key is not represented in the settings file, it will not be copied into Titania Delivery.
There are three types of lines allowed in the mappings file.
- Comments
- Lines beginning with a pound sign (
#
) are treated as comments. - Groups
- Lines that begin and end with square brackets (
[]
) are groups. The contents of the square brackets is either the file extension or the MIME type of the files for which the following mappings apply. For example,[pdf]
rules will apply to any file with a.pdf
extension, while[application/pdf]
will match any content with that MIME type. The special group[all]
can be used for mappings that apply for all non-XML content. - Mappings
- All other lines are considered mapping lines. A mapping line consists of:
- The metadata key or keys from the source file. Multiple metadata keys can be
combined with a pipe (
|
) character. If multiple keys are specified, the system will take the first metadata value it finds for one of the keys, scanning from left to right, and store it in Titania Delivery. - An equal sign,
=
. - The Titania Delivery metadata name to use for the file metadata key.
Only the metadata key is required. If no mapping is specified, the metadata name will be used as-is. For example:
[all] title description dc:author=author
These rules will look for
title
anddescription
metadata keys and store them with those names in Titania Delivery, whiledc:author
metadata will be stored asauthor
. - The metadata key or keys from the source file. Multiple metadata keys can be
combined with a pipe (
If multiple keys are mapped to the same Titania Delivery metadata name, the unique values of those keys will be combined and stored as distinct values in the Titania Delivery metadata. For example:
description=description dc:description=description
If a file contains both description
and dc:description
entries, and their values are different, both values will be stored in the
description
metadata entry in Titania Delivery. To force the system to
choose one or the other, combine them with |
, e.g.
dc:description|description=description
Specifying Delimiters for Multiple Values
Some systems make it difficult or impossible to create embedded properties with multiple
values. In such cases, multiple values can be specified with all values concatenated
together into a single value, separated by a delimiter. Delimiters are specified using
the
special group [delimiters]
In the [delimiters]
group, the key to each entry is the Titania Delivery
metadata name, and the value is the delimiter to use. The default
entry, if
any, will be applied to all metadata without a specific rule. If there is no
default
entry, then only those metadata fields with explicit rules will
have their values split.
The delimiter value should be the exact character or string of characters used to separate values. The following special rules apply:
- A value of
\t
,\n
, or\s
indicates tab characters, newline characters, or all whitespace characters, respectively. - A value beginning and ending with a forward slash character (/) indicates that the value is a Regular Expression.
Here is an example delimiter configuration:
[delimiters] default=, platform|audience=; country=\n author=/[;, ]/
This example specifies the following rules:
- The default delimiter, applied to all metadata properties extracted from embedded properties, is a comma.
- The delimiter for
platform
andaudience
metadata is a semicolon. - The
country
metadata will be split at newlines. - The
author
metadata will be split using a regular expression matching on semicolons, commas, or spaces.
Applicable File Types
The HARP-META/mappings.conf
file will be supported with the following
file formats:
- Microsoft Office
- RTF
- MP3
- MP4
- PNG
- JPG
- TIFF
- DWG
- EPUB
Default Mappings
Titania Delivery includes a built-in pseudo-doctype that carries the default mappings file. That document type is called Non-XML Metadata Mappings. This doctype will automatically be added to new projects by default. The default mapping file's contents are as follows.
# This file defines mappings from embedded metadata in file formats like PDF, # Microsoft Office, and some graphic formats, to the metadata keys to use # for those values in Titania Delivery. # # Each rule consists of a embedded_name=td_name pair. If the embedded name # and the TD name are identical, the TD name is optional. For example, to # map the embedded 'title' metadata to the name 'title' in TD, the rule # can simply be coded as "title". # # If multiple embedded metadata names contribute to a multi-valued TD metadata # entry, provide multiple rules. For example: # # a=z # b=z # # Will result in the TD metadata 'z' having both the values from 'a' and 'b'. # # To have a piece of embedded metadata copied into multiple TD metadata # entries, again, provide multiple rules. For example: # # a=x # a=y # a=z # # Will cause the embedded metadata 'a' to be copied to x, y, and z in TD. # # To populate a piece of metadata from any one of several metadata, but not # all of them, specify each embedded metadata key in the same rule, separated # by a pipe (|) character. For example: # # a|b|c=d # # - Will use 'a' if it is present, ignoring 'b' and 'c' # - Will use 'b' if it is present and 'a' is not, ignoring 'c' # - Will use 'c' if neither 'a' nor 'b' are present. # # Rules for specific formats can be grouped beneath a [type] line, where the # type is either the file extension or MIME type of the file type in quesiton. # Rules occuring before any groups, or occurring in the [all] group, apply to # all non-XML file types. [all] # These mappings apply for all file types title|Title|dc:title=title subject|Subject|dc:subject|cp:subject=subject Revision-Number|cp:revision=revision author|Author|Last-Author|meta:author|meta:last-author=author Creation-Date|created|meta:creation-date|dcterms:created=created Last-Modified|modified|dcterms:modified=modified # Duplicate language for both 'lang' and 'locale' metadata. language|Content-Language=lang language|Content-Language=locale