Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
RELEASE LIMITATION
Approved for public release
UNCLASSIFIED
UNCLASSIFIED
Published by
Command, Control, Communications and Intelligence Division
DSTO Defence Science and Technology Organisation
PO Box 1500
Edinburgh South Australia 5111 Australia
Telephone: 1300 DEFENCE
Fax: (08) 7389 6567
Commonwealth of Australia 2013
AR-015-636
June 2013
UNCLASSIFIED
UNCLASSIFIED
Executive Summary
A knowledge base describing entities (people, places, organisations, etc.), events, and
links between them is a valuable tool for intelligence analysis. However for analysts to
manually construct a knowledge base from unstructured (i.e. free text) source material
is a labour-intensive process. When an analysts immediate goal is to produce
reporting on a topic of interest, any effort spent in the construction of a knowledge
base actually detracts from their efficiency in achieving their immediate task. So while
the existence of a knowledge base with clear, accurate, and topical knowledge is useful
to the analyst and will increase their efficiency in the long term, the effort to construct
the knowledge base is seen as a distraction. This leads to the desire for intelligence
processing systems that can automatically construct a knowledge base from available
source material.
To date commercial offerings have focussed on individual tasks required for an
automatic knowledge base construction process. Two particular commercial tools that
are relevant are NetOwl Extractor from SRA International, Inc. and Palantir from
Palantir Technologies, Inc. NetOwl Extractor is an information extraction system that
processes unstructured text documents to extract structured information; in particular
information about entities, events, and the links between them. However NetOwl
Extractor works on the level of individual documents, and has no knowledge base
capability. Palantir is a knowledge base system that enables information that is in a
structured form (databases, spreadsheets, etc.) to be combined into a single knowledge
base and effectively exploited by analysts. However Palantir does not have any
mechanism to automatically process unstructured textual source material.
This report describes the result of an experiment to combine these two systems. The
goal was to examine the potential for a fully automated process to generate a
knowledge base from a large corpus of unstructured text documents. It was found that
although the process was straightforward, the knowledge base obtained was of a poor
quality and questionable utility for an intelligence analyst. Also it was found that the
architecture and design of the Palantir system was not optimised to efficiently support
consumption of large amounts of output from information extraction systems such as
NetOwl Extractor.
UNCLASSIFIED
UNCLASSIFIED
UNCLASSIFIED
UNCLASSIFIED
DSTO-GD-0748
Contents
1. INTRODUCTION............................................................................................................... 1
2. PROCEDURE ....................................................................................................................... 2
2.1 Trial Corpus and its Preparation............................................................................ 2
2.2 Processing with NetOwl Extractor......................................................................... 3
2.3 Generation of DocXML Files .................................................................................. 3
2.3.1
Mapping Entities ..................................................................................... 4
2.3.2
Mapping Events....................................................................................... 5
2.3.3
Mapping Links......................................................................................... 7
2.4 Importation of DocXML and Entity Resolution in Palantir ........................... 10
3. OBSERVATIONS ............................................................................................................. 11
3.1 Poor Quality Knowledge Base.............................................................................. 11
3.1.1
Source Document Artifacts Confusing Information Extraction ..... 11
3.1.2
Erroneous Named Entity Recognition................................................ 11
3.1.3
Erroneous Named Entity Resolution.................................................. 12
3.2 Performance Issues ................................................................................................. 13
4. CONCLUSION .................................................................................................................. 14
UNCLASSIFIED
UNCLASSIFIED
DSTO-GD-0748
UNCLASSIFIED
UNCLASSIFIED
DSTO-GD-0748
1. Introduction
A knowledge base describing entities (people, places, organisations, etc.), events, and links
between them is a valuable tool for intelligence analysis. However for analysts to
manually construct a knowledge base from unstructured (i.e. free text) source material is a
labour-intensive process. When an analysts immediate goal is to produce reporting on a
topic of interest, any effort spent in the construction of a knowledge base actually detracts
from their efficiency in achieving their immediate task. So while the existence of a
knowledge base with clear, accurate, and topical knowledge is useful to the analyst and
would increase their efficiency in the long term, the construction of the knowledge base is
seen as a distraction. This leads to the desire for intelligence processing systems that can
automatically construct a knowledge base from available source material.
To date commercial offerings have focussed on individual tasks required for an automatic
knowledge base construction process. Two particular commercial tools that are relevant
are NetOwl Extractor from SRA International, Inc. and Palantir from Palantir
Technologies, Inc. NetOwl Extractor is an information extraction system that processes
unstructured text documents to extract structured information; in particular information
about entities, events, and the links between them. However NetOwl Extractor works on
the level of individual documents, and has no knowledge base capability. Palantir is a
knowledge base system that enables information that is in a structured form (databases,
spreadsheets, etc.) to be combined into a single knowledge base and effectively exploited
by analysts. However Palantir does not have any mechanism to automatically process
unstructured textual source material.
This report describes the result of an experiment to combine these two systems. The goal
was to examine the potential for a fully automated process to generate a knowledge base
from a large corpus of unstructured text documents. However the process is described as
mostly-automatic because one aspect of the process was not automated for the sake of
expediency. This is explained in more detail in Section 2.4.
The core activity in the experiment was writing Java code that translates from the output
of NetOwl Extractor to a format suitable for passing to Palantir. NetOwl is capable of
generating a range of output formats, including a compact XML format that describes all
information extracted by the tool from a document. This format can be readily processed
using the XML parsing facilities in the standard Java runtime environment.
Palantir includes a facility for importing information into its knowledge base in a format
called DocXML. This is essentially an XML schema that is specialised for combining the
content of a text document together with information that can be inferred from the
document. The extensive Java API provided by Palantir includes convenience classes for
constructing DocXML document object models and writing DocXML files.
The fundamental issue in translating from the output of NetOwl Extractor to Palantirs
DocXML format was the different data models the two tools use. The way NetOwl
expresses the information it extracts from a document can be characterised as a full
entity-relationship model; that is entities, events, and links between entities or events are
all first-class elements of the data model. However the data model in Palantirs knowledge
UNCLASSIFIED
1
UNCLASSIFIED
DSTO-GD-0748
base can be characterised as a simple entity model; while entities and events are firstclass elements of the data model, while links between entities and events are second-class
elements. The consequences of this difference, and an approach to resolving it, are
discussed in the following section.
2. Procedure
The process for mostly-automatic construction of a Palantir knowledge base that was
explored is summarised as follows:
1. Select and preprocess a trial corpus.
2. Process the corpus documents with NetOwl Extractor.
3. Translate the output of NetOwl Extractor to Palantirs DocXML format.
4. Load the generated DocXML into a Palantir knowledge base.
The details of these steps are described below.
UNCLASSIFIED
2
UNCLASSIFIED
DSTO-GD-0748
UNCLASSIFIED
DSTO-GD-0748
user requests the source of the property, they are simply shown the document text but no
part of the document text is highlighted.
Within this general approach to mapping, there are some differences in how the classes of
entities, events, and links identified by NetOwl Extractor need to be handled. This is
described below.
UNCLASSIFIED
DSTO-GD-0748
UNCLASSIFIED
DSTO-GD-0748
event:conflict:attack_target
dsto.c3id.ia.object.ConflictEvent
dsto.c3id.ia.property.Description
dsto.c3id.ia.property.EventType
Attack Target
target
= reqnamed / dsto.c3id.ia.link.Target
attacker = req
/ dsto.c3id.ia.link.Attacker
weapon
= opt
/ time
= opt
/ place
= opt
/ dsto.c3id.ia.link.Place
Figure 1
/
/
/
/
/
dsto.c3id.ia.property.Attacker
dsto.c3id.ia.property.Weapon
dsto.c3id.ia.property.Time
dsto.c3id.ia.property.Place
The first line specifies the NetOwl ontology event type that the translation details
following apply to.
The second line specifies the Palantir ontology object type that the NetOwl type is
mapped to.
The third line specifies the Palantir ontology property type that is used for
mapping the descriptive mentions of the event instance from NetOwls ontology to
property values in the Palantir ontology.
The fourth line specifies the property type that is used to capture the third level of
the NetOwl ontology hierarchy as a property value rather than a subtype.
For example, consider a document containing the text Yesterday a group of tribal militants
mounted an attack on Yemeni President Ali Abdullah Saleh. Assume that NetOwl has
recognised this sentence as describing an event:conflict:attack_target event, and specifically
identified the phrase mounted an attack as the mention of the event. In this case, the
translation based on the specification in Figure 1 would result in a Palantir event object of
type dsto.c3id.ia.object.ConflictEvent with two property values. There would be a
property of type dsto.c3id.ia.property.EventType with a value of Attack Target, and a
property of type dsto.c3id.ia.property.Description with a value mounted an attack.
In Figure 1 the fifth and subsequent lines describe how attributes of the event identified by
NetOwl are to be mapped into the Palantir ontology. Each line contains four tokens, with
the following meaning.
The second token specifies the level of information required for this attribute
before the event instance recognised by NetOwl will be mapped over to the
Palantir knowledge base. The keyword reqnamed means the attribute is required
and must refer to an entity that has been identified by name, req means the
attribute is required but may refer to an entity that has only been identified by
description, while opt means the attribute is optional and its absence does not
result in the event instance being discarded for being too information-poor.
UNCLASSIFIED
6
UNCLASSIFIED
DSTO-GD-0748
The third and fourth tokens specify, respectively, the link type to use when the
attribute refers to an entity that has been mapped into the Palantir knowledge base
as an object, and the property type to use when the attribute refers to an entity that
has not been mapped into the Palantir knowledge base.
Continuing the example started above, assume that NetOwl has identified the phrase a
group of tribal militants as an entity mentioned by description, and the attacker attribute of
the event refers to this entity. Also assume that Ali Abdullah Saleh has been identified as
a person mentioned by name, and the target attribute of the event refers to this entity.
Further assume that Yesterday has been identified as a temporal entity, and the time
attribute of the event refers to this entity. Following the specification in Figure 1, the
translation will be as follows. There will be a link of type dsto.c3id.ia.link.Target
created from the event object to the entity object corresponding to the name mention of
Ali Abdullah Saleh. There will be a property of type dsto.c3id.ia.property.Attacker
that contains the value a group of tribal militants, and a property of type
dsto.c3id.ia.property.Time that contains the value Yesterday.
UNCLASSIFIED
7
UNCLASSIFIED
DSTO-GD-0748
link:person:person_associate
Person
Name=Bob
person
associate
Associate
entity:person
entity:person
Person
name=Bob
name=Alice
Name=Alice
Figure 2
UNCLASSIFIED
8
UNCLASSIFIED
DSTO-GD-0748
discarded by the translation process. In testing it was found that these two link types
typically constituted less than 1.2% of the link instances recognised.
Secondly it was found that the Palantir Workbench application did not have sufficient user
interface mechanisms to allow an analyst to readily accommodate the first option. The
application does have a mechanism in its graph view to allow an object that represents a
link between two other objects to be collapsed so that the intermediary object appears
like a direct link. However this mechanism must be invoked manually by a user and there
is no convenient way for a user to direct the application to visually collapse intermediary
objects to direct links in bulk.
Thus it was found that, in practice, the second option provided the better cost-benefit
trade-off. Given this, a flexible translation mechanism was developed based on a simple
text file format similar to the mechanism described previously for handling events. An
example is show below:
link:organization:org_founder
organization
founder / dsto.c3id.ia.link.Founder / dsto.c3id.ia.property.Founder
Figure 3
UNCLASSIFIED
DSTO-GD-0748
as the child of the link. In contrast, consider this in the context of the text The crime gang
La Putatos was founded by a shadowy underworld figure. In this case the founder entity has
only been mentioned by description, so will not have been translated as an entity object.
Thus the translation of the link will be a property added to the object representing the La
Putatos entity, of type "dsto.c3id.ia.property.Founder" and value a shadowy underworld
figure.
UNCLASSIFIED
10
UNCLASSIFIED
DSTO-GD-0748
3. Observations
The process described above was applied to 2000 documents from the trial corpus, using
the QuickStart version 3.3.1.1 of Palantir. Although the corpus was much larger, ingestion
was halted at 2000 documents due to performance issues that were encountered. Also, it
was found that there were serious quality issues with the knowledge base that was being
constructed through the process. These observations are elaborated upon below.
This can be explained as the translator capturing the correct orthography and
pronunciation of the Turkish name Abdlhalik ay in a document that can only contain
standard English letters. Turkish orthography is more phonemic than English in that the
spelling of a word completely determines its pronunciation; hence and u are considered
distinct letters with different pronunciation, as are and c.
A human analyst marking up this document could easily recognise the intent of the
translator, however an automatic information extraction routine fails to understand the
translators intent and does not recognise the text correctly.
Text: JAMMU-KASHMIR.
Entity recognition result: A place entity named Kashmir; i.e. the Jammu-
fragment was not recognised as part of the place name.
UNCLASSIFIED
11
UNCLASSIFIED
DSTO-GD-0748
In these examples, and many more observed in the results from the trial corpus, fragments
of text adjacent to a name were erroneously included in the name, or fragments of a name
separated by punctuation marks were omitted from the name.
Figure 4
Example of erroneous and inadequate knowledge base content - the Defence Ministry of
multiple countries being treated as a single organisation.
UNCLASSIFIED
12
UNCLASSIFIED
DSTO-GD-0748
Figure 4 shows the result of using the graph visualisation component of the Palantir
Workspace to search for an organisation entity named Defence Ministry and display
entities related to it.
The trial corpus contained many documents discussing entities named Defence
Ministry. In particular, there were documents discussing the Defence Ministry of France,
Tajikistan, and Afghanistan. But the nave entity resolution based only on name has
combined these entities. The result is a knowledge base that contains a single Defence
Ministry entity linked to numerous other entities including the countries of France and
Tajikistan, and the persons of Alain Richard (French Defence Minister at the time), Sherali
Khayrulloyev (Tajik Defence Minister at the time), Emomali Rahmonov (Tajik President at
the time), Mohammad Qasim Fahim (Afghan Defence Minister at the time), and Abdul
Rashid Dostam (Afghan Deputy Defence Minister at the time).
Also apparent in Figure 4 is the way nave fusion based on resolving objects with an exact
name match can fail when a non-English name is transcribed into English in different
ways in multiple documents. Mohammad Qasim Fahim and Mohammad Qasem
Fahim are two of the many possible romanisations of , but exact name
matching does not recognise this. Also the entities labelled Sherali Khayrulloyev and
Khayrulloyev should be resolved, but when there are documents that refer to a person
exclusively by last name and other documents that refer to a person exclusively by full
name, then exact name matching will not resolve the entities.
UNCLASSIFIED
DSTO-GD-0748
long as 30 seconds compared to less than 10 seconds when less than 1000 documents had
been ingested.
Investigation of the causes of this performance degradation led to the conclusion that the
Palantir software had not been designed and optimised to support the usage model of this
experiment. Specifically the system was not designed to store large quantities of
documents where each document had a large amount of automatically generated tags
concerning the entities, events, and links mentioned in the document. An examination of
the database structure used by the software, the architecture of the client application, and
the way these two components of the software interact, led to the conclusion that ingesting
large amounts of automatically annotated documents was not a usage model that Palantir
is designed to support efficiently. This suggests that this performance issue would also be
observed, albeit to a lesser extent, in the full version of the software that runs over Oracle
databases.
4. Conclusion
The mapping from NetOwl output to a Palantir knowledge base via DocXML worked
well. There are some issues with the expressivity of the Palantir knowledge base and
DocXML data models. However in practice there did not appear to be much impedance
mismatch and the information useful for knowledge base construction that was produced
by NetOwl Extractor could be mapped over.
However the general approach taken is not a suitable or sensible approach to take in
practice. Notably it is a usage model of the Palantir system that the software does not
appear to have been designed to support efficiently. Even if this were not the case, the
performance of NetOwl Extractor when used "out of the box", combined with simplistic
entity resolution, produces a low-quality knowledge base. It seems unlikely that an analyst
would find the knowledge base useful as any content used for analysis would need to be
rigorously inspected and validated. Given this requirement for effort on the analysts part,
it would seem more sensible for that effort to be spent creating a focused, high-quality
knowledge base by hand.
UNCLASSIFIED
14
Document
Title
Abstract
(U)
(U)
(U)
4. AUTHOR(S)
5. CORPORATE AUTHOR
Matthew C. Lowry
6b. AR NUMBER
7. DOCUMENT DATE
DSTO-GD-0748
AR-015-636
General Document
June 2013
8. FILE NUMBER
9. TASK NUMBER
2013/1012539/1
07/329
DCDS(I&WS)
14
http://dspace.dsto.defence.gov.au/dspace/
No Limitations
17. CITATION IN OTHER DOCUMENTS
18. DSTO RESEARCH LIBRARY THESAURUS
Yes
A knowledge base describing entities, events, and links between them is a valuable tool for intelligence analysis. However constructing
a knowledge base from unstructured source material is a labour intensive process. This leads to the desire for a process to automatically
construct a knowledge base from unstructured source material. NetOwl Extractor is an information extraction system that processes
unstructured text documents to extract structured information. Palantir is a knowledge base system that enables structured information
to be combined into a single knowledge base and effectively exploited by analysts. This report describes the result of an experiment to
combine these two systems; specifically to translate the output of NetOwl Extractor into a form that Palantir can ingest into its
knowledge base. It was found that although the translation process was straightforward, the knowledge base obtained was of a poor
quality and questionable utility for an intelligence analyst.