Sei sulla pagina 1di 21

UNCLASSIFIED

Mostly-Automatic Construction of a Palantir


Knowledge Base with NetOwl Extractor
Matthew C. Lowry
Command, Control, Communications and Intelligence Division
Defence Science and Technology Organisation
DSTO-GD-0748
ABSTRACT
A knowledge base describing entities, events, and links between them is a valuable tool
for intelligence analysis. However constructing a knowledge base from unstructured
source material is a labour intensive process. This leads to the desire for a process to
automatically construct a knowledge base from unstructured source material. NetOwl
Extractor is an information extraction system that processes unstructured text
documents to extract structured information. Palantir is a knowledge base system that
enables structured information to be combined into a single knowledge base and
effectively exploited by analysts. This report describes the result of an experiment to
combine these two systems; specifically to translate the output of NetOwl Extractor
into a form that Palantir can ingest into its knowledge base. It was found that although
the translation process was straightforward, the knowledge base obtained was of a
poor quality and questionable utility for an intelligence analyst.

RELEASE LIMITATION
Approved for public release

UNCLASSIFIED

UNCLASSIFIED

Published by
Command, Control, Communications and Intelligence Division
DSTO Defence Science and Technology Organisation
PO Box 1500
Edinburgh South Australia 5111 Australia
Telephone: 1300 DEFENCE
Fax: (08) 7389 6567
Commonwealth of Australia 2013
AR-015-636
June 2013

APPROVED FOR PUBLIC RELEASE

UNCLASSIFIED

UNCLASSIFIED

Mostly-Automatic Construction of a Palantir


Knowledge Base with NetOwl Extractor

Executive Summary
A knowledge base describing entities (people, places, organisations, etc.), events, and
links between them is a valuable tool for intelligence analysis. However for analysts to
manually construct a knowledge base from unstructured (i.e. free text) source material
is a labour-intensive process. When an analysts immediate goal is to produce
reporting on a topic of interest, any effort spent in the construction of a knowledge
base actually detracts from their efficiency in achieving their immediate task. So while
the existence of a knowledge base with clear, accurate, and topical knowledge is useful
to the analyst and will increase their efficiency in the long term, the effort to construct
the knowledge base is seen as a distraction. This leads to the desire for intelligence
processing systems that can automatically construct a knowledge base from available
source material.
To date commercial offerings have focussed on individual tasks required for an
automatic knowledge base construction process. Two particular commercial tools that
are relevant are NetOwl Extractor from SRA International, Inc. and Palantir from
Palantir Technologies, Inc. NetOwl Extractor is an information extraction system that
processes unstructured text documents to extract structured information; in particular
information about entities, events, and the links between them. However NetOwl
Extractor works on the level of individual documents, and has no knowledge base
capability. Palantir is a knowledge base system that enables information that is in a
structured form (databases, spreadsheets, etc.) to be combined into a single knowledge
base and effectively exploited by analysts. However Palantir does not have any
mechanism to automatically process unstructured textual source material.
This report describes the result of an experiment to combine these two systems. The
goal was to examine the potential for a fully automated process to generate a
knowledge base from a large corpus of unstructured text documents. It was found that
although the process was straightforward, the knowledge base obtained was of a poor
quality and questionable utility for an intelligence analyst. Also it was found that the
architecture and design of the Palantir system was not optimised to efficiently support
consumption of large amounts of output from information extraction systems such as
NetOwl Extractor.

UNCLASSIFIED

UNCLASSIFIED

This page intentionally left blank

UNCLASSIFIED

UNCLASSIFIED
DSTO-GD-0748

Contents
1. INTRODUCTION............................................................................................................... 1
2. PROCEDURE ....................................................................................................................... 2
2.1 Trial Corpus and its Preparation............................................................................ 2
2.2 Processing with NetOwl Extractor......................................................................... 3
2.3 Generation of DocXML Files .................................................................................. 3
2.3.1
Mapping Entities ..................................................................................... 4
2.3.2
Mapping Events....................................................................................... 5
2.3.3
Mapping Links......................................................................................... 7
2.4 Importation of DocXML and Entity Resolution in Palantir ........................... 10
3. OBSERVATIONS ............................................................................................................. 11
3.1 Poor Quality Knowledge Base.............................................................................. 11
3.1.1
Source Document Artifacts Confusing Information Extraction ..... 11
3.1.2
Erroneous Named Entity Recognition................................................ 11
3.1.3
Erroneous Named Entity Resolution.................................................. 12
3.2 Performance Issues ................................................................................................. 13
4. CONCLUSION .................................................................................................................. 14

UNCLASSIFIED

UNCLASSIFIED
DSTO-GD-0748

This page intentionally left blank.

UNCLASSIFIED

UNCLASSIFIED
DSTO-GD-0748

1. Introduction
A knowledge base describing entities (people, places, organisations, etc.), events, and links
between them is a valuable tool for intelligence analysis. However for analysts to
manually construct a knowledge base from unstructured (i.e. free text) source material is a
labour-intensive process. When an analysts immediate goal is to produce reporting on a
topic of interest, any effort spent in the construction of a knowledge base actually detracts
from their efficiency in achieving their immediate task. So while the existence of a
knowledge base with clear, accurate, and topical knowledge is useful to the analyst and
would increase their efficiency in the long term, the construction of the knowledge base is
seen as a distraction. This leads to the desire for intelligence processing systems that can
automatically construct a knowledge base from available source material.
To date commercial offerings have focussed on individual tasks required for an automatic
knowledge base construction process. Two particular commercial tools that are relevant
are NetOwl Extractor from SRA International, Inc. and Palantir from Palantir
Technologies, Inc. NetOwl Extractor is an information extraction system that processes
unstructured text documents to extract structured information; in particular information
about entities, events, and the links between them. However NetOwl Extractor works on
the level of individual documents, and has no knowledge base capability. Palantir is a
knowledge base system that enables information that is in a structured form (databases,
spreadsheets, etc.) to be combined into a single knowledge base and effectively exploited
by analysts. However Palantir does not have any mechanism to automatically process
unstructured textual source material.
This report describes the result of an experiment to combine these two systems. The goal
was to examine the potential for a fully automated process to generate a knowledge base
from a large corpus of unstructured text documents. However the process is described as
mostly-automatic because one aspect of the process was not automated for the sake of
expediency. This is explained in more detail in Section 2.4.
The core activity in the experiment was writing Java code that translates from the output
of NetOwl Extractor to a format suitable for passing to Palantir. NetOwl is capable of
generating a range of output formats, including a compact XML format that describes all
information extracted by the tool from a document. This format can be readily processed
using the XML parsing facilities in the standard Java runtime environment.
Palantir includes a facility for importing information into its knowledge base in a format
called DocXML. This is essentially an XML schema that is specialised for combining the
content of a text document together with information that can be inferred from the
document. The extensive Java API provided by Palantir includes convenience classes for
constructing DocXML document object models and writing DocXML files.
The fundamental issue in translating from the output of NetOwl Extractor to Palantirs
DocXML format was the different data models the two tools use. The way NetOwl
expresses the information it extracts from a document can be characterised as a full
entity-relationship model; that is entities, events, and links between entities or events are
all first-class elements of the data model. However the data model in Palantirs knowledge
UNCLASSIFIED
1

UNCLASSIFIED
DSTO-GD-0748

base can be characterised as a simple entity model; while entities and events are firstclass elements of the data model, while links between entities and events are second-class
elements. The consequences of this difference, and an approach to resolving it, are
discussed in the following section.

2. Procedure
The process for mostly-automatic construction of a Palantir knowledge base that was
explored is summarised as follows:
1. Select and preprocess a trial corpus.
2. Process the corpus documents with NetOwl Extractor.
3. Translate the output of NetOwl Extractor to Palantirs DocXML format.
4. Load the generated DocXML into a Palantir knowledge base.
The details of these steps are described below.

2.1 Trial Corpus and its Preparation


The corpus used for the experiment was a collection of documents dating from early 2002
that were disseminated by the Foreign Broadcast Information Service (FBIS; then a
component of the United States Central Intelligence Agency, now known as the Open
Source Centre and component of the United States Office of the Director of National
Intelligence). The documents consist primarily of translations of non-English language
print and World Wide Web media articles, and synopses of non-English language radio
and television broadcasts, from around the world. The text content is written in upper-case
letters, and is generally well-formed English prose but exhibits occasional grammatical or
spelling mistakes.
Due to the transmission and storage history of the corpus, the physical files containing the
documents were limited to approximately 8 kilobytes in size. Documents larger than this
size had been split into multiple physical files. Before submission to NetOwl Extractor the
split documents were reconstructed into single physical files to enable submission to
NetOwl as single units for processing. Without this reconstruction step, the potential for
intra-document resolution of entity mentions would be reduced, and hence the quality of
the information extraction would be reduced.
Another preprocessing step was removal of metadata that had been stored within the
document text as header and footer sections. If left in place NetOwl Extractor would
attempt to treat this metadata as data, potentially confusing its information extraction
routines and reducing the quality of its output.

UNCLASSIFIED
2

UNCLASSIFIED
DSTO-GD-0748

2.2 Processing with NetOwl Extractor


The document corpus was processed using NetOwl Extractor version 6.5.1 augmented
with the optional Link and Event version 2.4.0.1 rule base.
The processing was performed using the linkandevent-plain predefined configuration
parameter preset. This preset causes the tool to treat the input data as plain text, which is
appropriate for the trial corpus, and apply the link recognition and event recognition
subtasks in addition to the standard entity mention and equivalence recognition. The
output generated by the preset is the xml-full format.
Note that for the purposes of the experiment, NetOwl Extractor was used "out-of-the-box".
That is, only using the general purpose semantic rules provided with the software were
being used. In a typical deployment of the software, the semantic rules the software uses
to extract information from text would be specialised to suit the source documents and the
analysis to be performed on the extracted information. However the optimising the
performance of information extraction was not a concern for this experiment, so no
efforted was made to customise NetOwl for the context of the experiment.

2.3 Generation of DocXML Files


The primary challenge is how to achieve the mapping of the result set from NetOwl
Extractor, expressed in that tools ontology, to a corresponding description of objects in a
Palantir knowledge base ontology. Once the details of this are established, the mechanics
are straight forward: parse the XML produced by NetOwl, construct a DocXML DOM
using the convenience classes provided by the Palantir developer API, and use the
convenience methods in those DOM classes to create DocXML files. These files can then be
easily ingested into Palantir using the Palantir Workspace application.
In the case of this experiment, there was no predefined Palantir ontology that was to be the
target of the mapping. So a simple Palantir ontology was constructed to directly match the
ontology used by the NetOwl Extractor rule base. The NetOwl ontology was sliced at
level 2, with object types created in the Palantir ontology to match.
This slicing of the NetOwl ontology at level 2 for the purposes of the mapping was
purely for convenience. The information encoded in level 3 of the NetOwl Extractor
ontology class hierarchy is retained by mapping the level 3 term to a property value. For
example, an instance of class entity:organisation:military in NetOwl results is mapped to a
Palantir object of type Organisation and the object is given a property Organisation Type
= Military.
The practical issue with this approach is that in a DocXML file all pieces of information for
addition to the knowledge base must appeal to some portion of the text of the document as
the source of that information. For a property derived from the NetOwl ontological class,
there is no obvious source within the document text. However in the DocXML schema it is
valid to specify a text reference that starts at character index 0 and has a length of 0
characters. The Palantir system will accept DocXML containing such text references, and
the behaviour of the client in dealing with these null text references is sensible. If the
UNCLASSIFIED
3

UNCLASSIFIED
DSTO-GD-0748

user requests the source of the property, they are simply shown the document text but no
part of the document text is highlighted.
Within this general approach to mapping, there are some differences in how the classes of
entities, events, and links identified by NetOwl Extractor need to be handled. This is
described below.

2.3.1 Mapping Entities


There is a difference in the way the concept of entity is used by NetOwl Extractor and
Palantir, which necessitates limiting the entity types that will be mapped from the NetOwl
results. In NetOwl output, any thing that might be referred to by an event or link instance
must be an entity. Thus NetOwl Extractor will identify things like dates and times,
quantities of money, and even unitless numbers as date, currency or numeric
entities. Doing so allows them to be referred to by events; e.g. giving the date and amount
of a financial transaction between two organisations. However the Palantir system is
designed to have details like dates and quantities stored as properties of objects. So in the
mapping only the subset of NetOwl entity types that are sensible to have as entity objects
in the Palantir knowledge base are mapped. For this experiment people, places, and
organisations were mapped over.
For entity types that are to be mapped over, there needs to be a chosen threshold of
minimum information content below which an instance is considered too informationpoor to be of value in a knowledge base. The threshold chosen for this experiment was at
least one name mention. Entities that are only mentioned by description (e.g. a man on the
back of a donkey, or a group of three militants) are not mapped over.
To map the information regarding mentions of an entity, a property on the Palantir entity
types must be chosen to receive the information. This is easily achieved for name mentions
- the mention is mapped to a value of a name property with a text reference corresponding
to the mention text identified by NetOwl Extractor. If NetOwl also identified descriptive
mentions of the entity, then the description can be mapped to a value of a suitable
property type with a text reference corresponding to the descriptive mention identified by
NetOwl. For example, for descriptive mentions of an organisation entity, a property such
as Organisation Description would be suitable.
In contrast, the pronoun mentions for people are not mapped over. Having a property on
entities in the knowledge base for pronouns that have been used to refer to an entity has
little analytical utility for users of the knowledge base. While gender-specificity of English
first-person pronouns does carry information, NetOwl extractor infers a gender attribute
of person entities so this information is not lost.
Any attributes of an entity inferred by NetOwl are also mapped across as properties on the
corresponding Palantir object. Doing so has the same issue as noted above because
NetOwl has inferred the attribute there is no segment of text in the document that
corresponds directly to the attribute value. Again the solution is to use a null text
reference that identifies a zero-length segment starting at character index 0 of the
document as the source of the attribute value.
UNCLASSIFIED
4

UNCLASSIFIED
DSTO-GD-0748

2.3.2 Mapping Events


The issues involved in mapping event instances identified by NetOwl Extractor are similar
to the issues discussed previously with entities. However there are some differences to
how the issues are best dealt with.
As with entities, the mapping from NetOwl Extractor ontology class to Palantir ontology
object type is done at level 2 of the NetOwl ontology hierarchy. The level 3 term of the
class of an event instance is mapped to a property value with a null text reference.
The segments of text that NetOwl Extractor identifies as mentions of an event are
generally descriptive in nature, so they can be handled in the same way as descriptive
mentions of an entity. The event mention is mapped to a value of a description property
with a text reference corresponding to the mention text identified by NetOwl.
There is also the issue of deciding whether for a given event instance, NetOwl Extractor
has identified sufficient information for it to be worthwhile translating the instance to the
knowledge base. This is similar to the issue faced with entity mapping discussed
previously. In the case of event mapping, there are two aspects to consider.
The first aspect to consider is whether an important attribute of the event has been
identified. For example, the default Link and Event rule base will recognise any
occurrence of the word attack as an instance of the event class
event:conflict:attack_target. However the text will not necessarily identify who is the
attacker or target involved. Even if the text does specify the attacker and target, NetOwl
may fail to recognise this information.
Secondly, when an attribute of an event is a reference to an entity (e.g. the attacker or
target in the case of an attack event), there is the consideration of whether that entity has
been identified to a sufficient level. The assessment of this will depend on the purpose for
which the knowledge base is being constructed. Continuing the example, for some
purposes the recognition of an attack event is not useful unless the attacker is recognised
and identified by name. But in other situations, it may be that an attack event where the
attacker is only identified by description (e.g. a man on the back of a donkey) is still a useful
addition to the knowledge base.
When an attribute of an event is a reference to a named entity, then that entity will have
been mapped to an entity object in the DocXML translation. Hence the appropriate
translation of the attribute of the event that is a reference to that named entity is a link in
the Palantir knowledge base, where the event is the parent of the link and the entity is the
child of the link. When an attribute of an event is a reference to an entity that is only
mentioned by description, the appropriate translation is to make a property on the event
object with the value of the property being the description of the entity.
To facilitate flexibility with regard to these issues, the approach taken was to develop a
simple file format that allowed specification of the choices made. This allows the
translation process to be tailored to the circumstances and purpose of the knowledge base
being generated. An example is shown below:
UNCLASSIFIED
5

UNCLASSIFIED
DSTO-GD-0748
event:conflict:attack_target
dsto.c3id.ia.object.ConflictEvent
dsto.c3id.ia.property.Description
dsto.c3id.ia.property.EventType
Attack Target
target
= reqnamed / dsto.c3id.ia.link.Target
attacker = req
/ dsto.c3id.ia.link.Attacker
weapon
= opt
/ time
= opt
/ place
= opt
/ dsto.c3id.ia.link.Place

Figure 1

/
/
/
/
/

dsto.c3id.ia.property.Attacker
dsto.c3id.ia.property.Weapon
dsto.c3id.ia.property.Time
dsto.c3id.ia.property.Place

Example configuration for translating events

The translation code makes use of this format as follows:

The first line specifies the NetOwl ontology event type that the translation details
following apply to.

The second line specifies the Palantir ontology object type that the NetOwl type is
mapped to.

The third line specifies the Palantir ontology property type that is used for
mapping the descriptive mentions of the event instance from NetOwls ontology to
property values in the Palantir ontology.

The fourth line specifies the property type that is used to capture the third level of
the NetOwl ontology hierarchy as a property value rather than a subtype.

For example, consider a document containing the text Yesterday a group of tribal militants
mounted an attack on Yemeni President Ali Abdullah Saleh. Assume that NetOwl has
recognised this sentence as describing an event:conflict:attack_target event, and specifically
identified the phrase mounted an attack as the mention of the event. In this case, the
translation based on the specification in Figure 1 would result in a Palantir event object of
type dsto.c3id.ia.object.ConflictEvent with two property values. There would be a
property of type dsto.c3id.ia.property.EventType with a value of Attack Target, and a
property of type dsto.c3id.ia.property.Description with a value mounted an attack.
In Figure 1 the fifth and subsequent lines describe how attributes of the event identified by
NetOwl are to be mapped into the Palantir ontology. Each line contains four tokens, with
the following meaning.

The first token is the name of the attribute in NetOwls ontology.

The second token specifies the level of information required for this attribute
before the event instance recognised by NetOwl will be mapped over to the
Palantir knowledge base. The keyword reqnamed means the attribute is required
and must refer to an entity that has been identified by name, req means the
attribute is required but may refer to an entity that has only been identified by
description, while opt means the attribute is optional and its absence does not
result in the event instance being discarded for being too information-poor.

UNCLASSIFIED
6

UNCLASSIFIED
DSTO-GD-0748

The third and fourth tokens specify, respectively, the link type to use when the
attribute refers to an entity that has been mapped into the Palantir knowledge base
as an object, and the property type to use when the attribute refers to an entity that
has not been mapped into the Palantir knowledge base.

Continuing the example started above, assume that NetOwl has identified the phrase a
group of tribal militants as an entity mentioned by description, and the attacker attribute of
the event refers to this entity. Also assume that Ali Abdullah Saleh has been identified as
a person mentioned by name, and the target attribute of the event refers to this entity.
Further assume that Yesterday has been identified as a temporal entity, and the time
attribute of the event refers to this entity. Following the specification in Figure 1, the
translation will be as follows. There will be a link of type dsto.c3id.ia.link.Target
created from the event object to the entity object corresponding to the name mention of
Ali Abdullah Saleh. There will be a property of type dsto.c3id.ia.property.Attacker
that contains the value a group of tribal militants, and a property of type
dsto.c3id.ia.property.Time that contains the value Yesterday.

2.3.3 Mapping Links


The primary issue to resolve in mapping links from NetOwl Extractor output to a Palantir
knowledge base is that there is a fundamental difference between the way links are treated
by the two systems. In NetOwl output, links are a first-class element of the data model;
that is a link instance has an identity, can be referred to by its identity, and can in principle
have any number of attributes. However in the data model of the Palantir knowledge base,
a link is a second class element; that is an instance of a link is always attached to a parent
object and the link does not itself have an identify so it cannot be referred to and it cannot
have any properties associated with it.
In effect, a link in a Palantir knowledge base is a special kind of property that contains a
value that is always interpreted as a reference to another object. Viewed in graph-theoretic
terms, in NetOwl output a link is a node that has outgoing directed edges to other nodes
that it is linking together. But in Palantir a link is itself a directed edge from one node to
another. This difference is elucidated in an example shown in Figure 2.

UNCLASSIFIED
7

UNCLASSIFIED
DSTO-GD-0748

Example text: Alice is an associate of Bob.


NetOWL Output Structure

Knowledge Base Structure Expected


By Palantir

link:person:person_associate

Person

Name=Bob
person

associate

Associate

entity:person

entity:person

Person

name=Bob

name=Alice

Name=Alice

Figure 2

Difference in nature of links between NetOwl and Palantir

Given the above consideration, there are two approaches available:


1. Directly map the heavy-weight representation in NetOwl output into Palantir.
This can be achieved by creating object types in a Palantir ontology that correspond
to the link types in the NetOwl ontology.
2. Translate the NetOwl output to the light-weight representation that is natural for
the data model of Palantir knowledge bases.
The advantage of the first option is that all the information contained in the NetOwl
output is mapped across to the Palantir knowledge base. The disadvantage is that the
resulting content in the Palantir knowledge base is not in a natural structure that is
assumed by the Palantir Workbench application that analysts would use to access,
visualise, and analyse the content. Conversely the disadvantage of the second option is the
potential to lose information when a link is translated from a first-class to a second-class
data model element, while the advantage is the production of Palantir knowledge base
content that is in the structure expected by the analyst-facing components of the Palantir
system.
Both of the options listed above were tested, and the conclusion was that the second
option was preferable.
Firstly, in practice there was little actual information lost by translating from NetOwls
first-class links to Palantirs second-class links. Of the link types that the Link and Event
rule base can recognise, only two types actually have attributes that would need to be

UNCLASSIFIED
8

UNCLASSIFIED
DSTO-GD-0748

discarded by the translation process. In testing it was found that these two link types
typically constituted less than 1.2% of the link instances recognised.
Secondly it was found that the Palantir Workbench application did not have sufficient user
interface mechanisms to allow an analyst to readily accommodate the first option. The
application does have a mechanism in its graph view to allow an object that represents a
link between two other objects to be collapsed so that the intermediary object appears
like a direct link. However this mechanism must be invoked manually by a user and there
is no convenient way for a user to direct the application to visually collapse intermediary
objects to direct links in bulk.
Thus it was found that, in practice, the second option provided the better cost-benefit
trade-off. Given this, a flexible translation mechanism was developed based on a simple
text file format similar to the mechanism described previously for handling events. An
example is show below:
link:organization:org_founder
organization
founder / dsto.c3id.ia.link.Founder / dsto.c3id.ia.property.Founder

Figure 3

Example Configuration for Translating Links

The translation code makes use of this format as follows.


The first line specifies the NetOwl ontology link type that the translation details following
apply to.
The second line specifies the attribute of the NetOwl link that specifies the entity that will
be the parent object for the translation into Palantirs knowledge base. This implicitly
mandates that this attribute is both present and referring to an entity that was mentioned
by name (otherwise the entity will not have been translated; see Section 2.3.1).
The third line gives the attribute of the NetOwl link that specifies the entity that will be the
child object for the translation. In the case where this attribute refers to an entity that was
mentioned by name (and hence translated as an entity object), the NetOwl link can be
translated as the Palantir link type given in the second token of the third line. However if
this attribute refers to an entity that was only mentioned by description (and hence not
translated as an entity object), then the NetOwl link must be translated as a property on
the parent object, and the property type to use is given in third token of the third line.
As an example, consider the specification in Figure 3 in the context of the text The crime
gang La Putatos was founded by Fred Bloggs. Assume NetOwl recognised the two entities
and the link from organisation to founder. Since the two entities were mentioned by name,
they will have been translated as entity objects. So the link between them would be
translated as a link of type dsto.c3id.ia.link.Founder with the object representing the La
Putatos entity as the parent of the link and the object representing the Fred Bloggs entity
UNCLASSIFIED
9

UNCLASSIFIED
DSTO-GD-0748

as the child of the link. In contrast, consider this in the context of the text The crime gang
La Putatos was founded by a shadowy underworld figure. In this case the founder entity has
only been mentioned by description, so will not have been translated as an entity object.
Thus the translation of the link will be a property added to the object representing the La
Putatos entity, of type "dsto.c3id.ia.property.Founder" and value a shadowy underworld
figure.

2.4 Importation of DocXML and Entity Resolution in Palantir


The DocXML files generated by translating NetOwl Extractor output using the process
described above can easily be ingested into a Palantir knowledge base using the
Workspace client application. The Import function can be used to select any number of
DocXML documents and load their content into an investigation, and the imported data
can then be published to the base realm (i.e. shared knowledge base).
As noted in the introduction, this part of the process was the only step that was not
automated. It was decided that the effort required to import DocXML manually was quite
small: click a button, use a file selection dialog to select the files, chose a few options, then
click OK and wait for the process to finish. Although there are client-side APIs that allow
this import process to be done automatically, for the purposes of this experiment the effort
required to develop and test code was not warranted.
The primary issue encountered in this process was choosing the behaviour of the entity
resolution that can optionally be performed when importing DocXML-encoded data.
The NetOwl Extractor information extraction routines work on the basis of individual
documents. When an entity is mentioned in more than one document, there will be
different entity objects created from each document. Palantir supports the resolution of
multiple objects that actually describe the same logical entity into a single logical object.
When multiple objects are resolved together the properties and links of all the individual
objects are combined together in the resolved object.
In the case of ingesting a large number of documents and a frequently mentioned entity,
there will potentially be a large number of separate objects created to describe the one
entity. To produce a knowledge base that is not cluttered, and convenient for analysts to
use, these objects must be resolved and fused together.
Palantir allows the creation of object resolution suites, which are sets of rules describing
criteria for automatically resolving and fusing knowledge base objects that are describing
the same entity. These criteria are generally of the form If two objects of type X have a
matching value for property Y, assume they are describing the same logical entity and
resolve them into a single object. For this experiment the criteria used was to resolve and
fuse any given set of person objects that shared an exactly matching value in their name
property, and similarly for place or organisation objects.

UNCLASSIFIED
10

UNCLASSIFIED
DSTO-GD-0748

3. Observations
The process described above was applied to 2000 documents from the trial corpus, using
the QuickStart version 3.3.1.1 of Palantir. Although the corpus was much larger, ingestion
was halted at 2000 documents due to performance issues that were encountered. Also, it
was found that there were serious quality issues with the knowledge base that was being
constructed through the process. These observations are elaborated upon below.

3.1 Poor Quality Knowledge Base


The knowledge base that was constructed was generally of poor quality. This appeared to
be due to a compounding effect from three primary sources of error. Firstly some of the
source documents were malformed. Secondly, errors were made in the information
extraction by NetOwl (both false positives and false negatives). Thirdly, there were errors
in entity resolution and fusion at the intra-document level by NetOwl, and at the interdocument level by Palantir. Examples of some of the types of error seen are given below.

3.1.1 Source Document Artifacts Confusing Information Extraction


Information extraction tools generally struggle to correctly interpret documents if the text
deviates from modes of expression that the tool expects. For example, exploration of the
knowledge base revealed a person entity with a name value Abd and a description
value Former State Minister. It was found that this entity came from a document in the
trial corpus that gave a synopsis of a Turkish television news broadcast. The format of the
document was a series of numbered points, which included the following:
FORMER STATE MINISTER ABD(U DIERESIS)LHALIK (C CEDILLA)AY HAS []

This can be explained as the translator capturing the correct orthography and
pronunciation of the Turkish name Abdlhalik ay in a document that can only contain
standard English letters. Turkish orthography is more phonemic than English in that the
spelling of a word completely determines its pronunciation; hence and u are considered
distinct letters with different pronunciation, as are and c.
A human analyst marking up this document could easily recognise the intent of the
translator, however an automatic information extraction routine fails to understand the
translators intent and does not recognise the text correctly.

3.1.2 Erroneous Named Entity Recognition


In addition to the source of confusion noted above, the named entity recognition routines
in information extraction tools can fail to correctly parse even well formed text. Examples
seen in the trail corpus include:

Text: JAMMU-KASHMIR.
Entity recognition result: A place entity named Kashmir; i.e. the Jammu-
fragment was not recognised as part of the place name.

UNCLASSIFIED
11

UNCLASSIFIED
DSTO-GD-0748

Text: IN 2001 USAMA BIN LADIN ROSE TO PROMINENCE [].


Entity recognition result: A person entity named Usama Bin Ladin Rose; i.e. a
verb following the name was erroneously considered part of the name.

In these examples, and many more observed in the results from the trial corpus, fragments
of text adjacent to a name were erroneously included in the name, or fragments of a name
separated by punctuation marks were omitted from the name.

3.1.3 Erroneous Named Entity Resolution


The experiment's nave approach to the resolution and fusion of entities between
documents leads to numerous errors as well. There are cases where recognised entities in
different documents should have been resolved but were not, and conversely cases where
resolution occurred but should not have. Examples of both these forms of error can be seen
in the diagram shown in Figure 4.

Figure 4

Example of erroneous and inadequate knowledge base content - the Defence Ministry of
multiple countries being treated as a single organisation.
UNCLASSIFIED

12

UNCLASSIFIED
DSTO-GD-0748

Figure 4 shows the result of using the graph visualisation component of the Palantir
Workspace to search for an organisation entity named Defence Ministry and display
entities related to it.
The trial corpus contained many documents discussing entities named Defence
Ministry. In particular, there were documents discussing the Defence Ministry of France,
Tajikistan, and Afghanistan. But the nave entity resolution based only on name has
combined these entities. The result is a knowledge base that contains a single Defence
Ministry entity linked to numerous other entities including the countries of France and
Tajikistan, and the persons of Alain Richard (French Defence Minister at the time), Sherali
Khayrulloyev (Tajik Defence Minister at the time), Emomali Rahmonov (Tajik President at
the time), Mohammad Qasim Fahim (Afghan Defence Minister at the time), and Abdul
Rashid Dostam (Afghan Deputy Defence Minister at the time).
Also apparent in Figure 4 is the way nave fusion based on resolving objects with an exact
name match can fail when a non-English name is transcribed into English in different
ways in multiple documents. Mohammad Qasim Fahim and Mohammad Qasem
Fahim are two of the many possible romanisations of , but exact name
matching does not recognise this. Also the entities labelled Sherali Khayrulloyev and
Khayrulloyev should be resolved, but when there are documents that refer to a person
exclusively by last name and other documents that refer to a person exclusively by full
name, then exact name matching will not resolve the entities.

3.2 Performance Issues


Two particular performance issues were observed during the experiment.
The first was that the Palantir Quickstart edition, which uses the open source database
software PostgreSQL as its backend, performed poorly when running on a machine that
also had anti-virus software active. In particular I/O throughput and CPU utilisation on a
multi-core CPU was poor (most of the time only one core would be active). Investigation
of the issue suggested that PostgreSQL was interacting poorly with the anti-virus scanning
software. This was raised with a Palantir Technologies engineer who confirmed this was a
known issue; their recommendation was to configure the anti-virus scanning software to
ignore the data directory used by PostgreSQL. However in the context of this experiment it
was not possible to verify the extent to which this would resolve the observed
performance issue.
The second issue was that as increasing numbers of documents processed by NetOwl were
ingested into the Palantir knowledge base, the performance of the system steadily
decreased. Most notably, the time taken to ingest a batch of documents and publish the
ingested knowledge to the core knowledge base rapidly increased. The first batch of 200
documents was ingested and published in less than 10 minutes, while the fifth batch took
over 1.5 hours to process, and the tenth batch approximately 5 hours. Also, as the number
of documents ingested into the system increased, the time taken by the Workbench
application to display any given document increased. After 2000 documents had been
ingested, to load and display any given document within the workbench would take as
UNCLASSIFIED
13

UNCLASSIFIED
DSTO-GD-0748

long as 30 seconds compared to less than 10 seconds when less than 1000 documents had
been ingested.
Investigation of the causes of this performance degradation led to the conclusion that the
Palantir software had not been designed and optimised to support the usage model of this
experiment. Specifically the system was not designed to store large quantities of
documents where each document had a large amount of automatically generated tags
concerning the entities, events, and links mentioned in the document. An examination of
the database structure used by the software, the architecture of the client application, and
the way these two components of the software interact, led to the conclusion that ingesting
large amounts of automatically annotated documents was not a usage model that Palantir
is designed to support efficiently. This suggests that this performance issue would also be
observed, albeit to a lesser extent, in the full version of the software that runs over Oracle
databases.

4. Conclusion
The mapping from NetOwl output to a Palantir knowledge base via DocXML worked
well. There are some issues with the expressivity of the Palantir knowledge base and
DocXML data models. However in practice there did not appear to be much impedance
mismatch and the information useful for knowledge base construction that was produced
by NetOwl Extractor could be mapped over.
However the general approach taken is not a suitable or sensible approach to take in
practice. Notably it is a usage model of the Palantir system that the software does not
appear to have been designed to support efficiently. Even if this were not the case, the
performance of NetOwl Extractor when used "out of the box", combined with simplistic
entity resolution, produces a low-quality knowledge base. It seems unlikely that an analyst
would find the knowledge base useful as any content used for analysis would need to be
rigorously inspected and validated. Given this requirement for effort on the analysts part,
it would seem more sensible for that effort to be spent creating a focused, high-quality
knowledge base by hand.

UNCLASSIFIED
14

Page classification: UNCLASSIFIED


DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION
DOCUMENT CONTROL DATA
2. TITLE

Mostly-Automatic Construction of a Palantir Knowledge Base with


NetOwl Extractor

1. PRIVACY MARKING/CAVEAT (OF DOCUMENT)

3. SECURITY CLASSIFICATION (FOR UNCLASSIFIED REPORTS


THAT ARE LIMITED RELEASE USE (L) NEXT TO DOCUMENT
CLASSIFICATION)

Document
Title
Abstract

(U)
(U)
(U)

4. AUTHOR(S)

5. CORPORATE AUTHOR

Matthew C. Lowry

DSTO Defence Science and Technology Organisation


PO Box 1500
Edinburgh South Australia 5111 Australia

6a. DSTO NUMBER

6b. AR NUMBER

6c. TYPE OF REPORT

7. DOCUMENT DATE

DSTO-GD-0748

AR-015-636

General Document

June 2013

8. FILE NUMBER

9. TASK NUMBER

10. TASK SPONSOR

11. NO. OF PAGES

12. NO. OF REFERENCES

2013/1012539/1

07/329

DCDS(I&WS)

14

13. DSTO Publications Repository

14. RELEASE AUTHORITY

http://dspace.dsto.defence.gov.au/dspace/

Chief, Command, Control, Communications and Intelligence


Division

15. SECONDARY RELEASE STATEMENT OF THIS DOCUMENT

Approved for public release


OVERSEAS ENQUIRIES OUTSIDE STATED LIMITATIONS SHOULD BE REFERRED THROUGH DOCUMENT EXCHANGE, PO BOX 1500, EDINBURGH, SA 5111

16. DELIBERATE ANNOUNCEMENT

No Limitations
17. CITATION IN OTHER DOCUMENTS
18. DSTO RESEARCH LIBRARY THESAURUS

Yes

Knowledge management, Natural language processing, Information extraction, Information fusion


19. ABSTRACT

A knowledge base describing entities, events, and links between them is a valuable tool for intelligence analysis. However constructing
a knowledge base from unstructured source material is a labour intensive process. This leads to the desire for a process to automatically
construct a knowledge base from unstructured source material. NetOwl Extractor is an information extraction system that processes
unstructured text documents to extract structured information. Palantir is a knowledge base system that enables structured information
to be combined into a single knowledge base and effectively exploited by analysts. This report describes the result of an experiment to
combine these two systems; specifically to translate the output of NetOwl Extractor into a form that Palantir can ingest into its
knowledge base. It was found that although the translation process was straightforward, the knowledge base obtained was of a poor
quality and questionable utility for an intelligence analyst.

Page classification: UNCLASSIFIED

Potrebbero piacerti anche