Sei sulla pagina 1di 69

UNIVERSITY OF CALIFORNIA,

IRVINE

A Part-of-Query Tagger for Search Queries

THESIS

submitted in partial satisfaction of the requirements


for the degree of

MASTER OF SCIENCE

in Computer Science

by

Tommy Chheng

Thesis Committee:
Professor Ramesh Jain, Chair
Professor Bill Tomlinson
Professor Chen Li

2010

c 2010 Tommy Chheng
TABLE OF CONTENTS

Page

LIST OF FIGURES iv

LIST OF TABLES v

ACKNOWLEDGMENTS vi

CURRICULUM VITAE vii

ABSTRACT viii

1 Introduction 1
1.1 Problems with Full Text Search . . . . . . . . . . . . . . . . . . . . . 3
1.2 Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Facet Classification 11
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Structure of Search Queries . . . . . . . . . . . . . . . . . . . 12
2.1.2 Part-of-Speech Tagger . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.4 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . 14
2.1.5 Semantic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Unifying Facet Classification with Query Segmentation 27


3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 29

ii
3.1.2 Noun Compound Bracketing . . . . . . . . . . . . . . . . . . . 30
3.1.3 Machine Learning Approaches . . . . . . . . . . . . . . . . . . 30
3.1.4 External Information . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Design Parameters . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Mutual Information and Generalized Bayes Classifier . . . . . 34
3.2.3 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 ResearchWatch: Case Study 41


4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Search Engine Components . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Apache Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Apache Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.1 Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Aggregated Data Displays . . . . . . . . . . . . . . . . . . . . 47
4.4.3 Search Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Qualitative Comparison With Research.gov . . . . . . . . . . . . . . . 49
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Conclusion 53
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 56

Appendices 59
A Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.1 Test Query Generation . . . . . . . . . . . . . . . . . . . . . . 59

iii
LIST OF FIGURES

Page

1.1 An example of field search. . . . . . . . . . . . . . . . . . . . . . . . 2


1.2 A selection of the ‘male’ facet for nobel prize winners in UC Berkeley’s
Flamenco nobel prize faceted search system. . . . . . . . . . . . . . . 5
1.3 Melvyl is a University of California Library search engine using a
faceted search interface. . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Bing results for “barack obama” and “uci” show customized results
due to query entity recognition. . . . . . . . . . . . . . . . . . . . . . 15
2.2 An example of conditional independence assumption in Naive Bayes. 19
2.3 Facet Classification Accuracy Table using Naive Bayes. . . . . . . . 22

3.1 Query segmentation and classification accuracy. . . . . . . . . . . . . 37

4.1 A high-level architecture view . . . . . . . . . . . . . . . . . . . . . . 43


4.2 An overview of our search system. . . . . . . . . . . . . . . . . . . . . 45
4.3 ResearchWatch front page. . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Search results for “biodiversity stanford” with and without part-of-
query tagging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Research.gov’s main search interface, a field search. . . . . . . . . . . 50

iv
LIST OF TABLES

Page

1.1 NSF Grant Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Facets with an example record for classification. . . . . . . . . . . . . 17


2.2 Facet Classification Accuracy Table using Naive Bayes. . . . . . . . 21
2.3 Naive Bayes probability of word given class calculations. Each prob-
ability is the count of the class term frequency divided by the total
amount of terms in the class. It is greater than 1 because of Laplace
smoothing which is used to safeguard against unknown words. . . . . 23
2.4 The confusion matrix for Naive Bayes classification predictions with
60,000 training examples. . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Examples of Naive Bayes classification errors. . . . . . . . . . . . . . 24

3.1 Possible segmentations for “solar university of washington” . . . . . 28


3.2 Query segmentation and classification accuracy comparing Naive Bayes
assumption and word frequency threshold parameter. . . . . . . . . . 38
3.3 Incorrect segmentation/classification examples . . . . . . . . . . . . 39

v
ACKNOWLEDGMENTS

I would like to thank my advisors and committee members: Bill Tomlinson, Ramesh
Jain and Chen Li who have taken time to help and aid this project. I thank Amazon
Web Services for providing EC2 hosting and the Alfred P. Sloan Foundation for
providing the funds to request the NSF grants dataset. The Social Code group for
the helpful discussions. My parents for their inspiration and hard work ethic. And
lastly, my girlfriend Gina, for dragging me away from the computer to enjoy the sun.

vi
CURRICULUM VITAE

Tommy Chheng

EDUCATION
University of California, Irvine Irvine, California
Master of Science in Computer Science 2010

University of California, San Diego La Jolla, California


Bachelor of Science in Computer Engineering 2007

vii
ABSTRACT
A Part-of-Query Tagger for Search Queries

By

Tommy Chheng

Master of Science in Computer Science

University of California, Irvine, 2010

Professor Ramesh Jain, Chair

In this thesis, we explore expanding beyond keyword searches for structured data
when using a text field search interface. By associating meaning with each term in a
query, a search system can provide more relevant results. This thesis will cover the
implementation of a part-of-query tagging system which tags a search query’s terms
with its associated metadata field. Specifically, we create the system using a unified
segmentation and classification algorithm based on a generalized Bayes classifier. We
perform evaluations based on a generated test set and achieve an 85% accuracy in
tagging the terms in a search query to the correct metadata fields. We incorporate the
system into ResearchWatch, an unofficial search engine for NSF research grants and
demonstrate the applicability of the system by providing a qualitative comparison to
Research.gov, the official NSF research grant search engine.

viii
Chapter 1

Introduction

In 2009, the Obama administration pushed for increased public access to high quality
datasets using data.gov. These high quality datasets are in a variety of formats: xml,
csv, kml, shapefile. An important trait of the available dataset is the available meta-
data rather than just raw text data. With the high quality metadata, practitioners
can implement higher quality systems for utilizing the datasets.

In particular, we look at developing search applications. When we are given a raw


text dataset, search applications are forced to use a keyword search approach with a
search text box as its interface. This general purpose approach is commonly accepted
because it is a flexible method of allowing users to input search queries. When the
search system only contains documents with unstructured text, the text field interface
is sufficient. However, when the underlying data contains metadata information, the
text box has limitations. For example, a keyword search for “toyota tacoma” would
search for the words “toyota” and/or “tacoma.” This is less precise than searching
for “automobile make:toyota” and “automobile model:tacoma.”

Field search is one possible solution. Field search works by searching for the terms

1
Figure 1.1: An example of field search.

only in the specified field. One criticism of field search is the amount of available
options for users. Additionally, as the search data becomes more broad, it will become
difficult to select the right fields to display. See figure 4.5. for an example of field
search.

Another option is to create a search system which collapses the metadata and appends
each metadata field to just one text field for search. For example, imagine a product
database for automobiles with an example entry of (make: “toyota”, model: “prius”,
review text:“The Toyota Prius offers better gas mileage than the Honda Insight.”).
Search systems can collapse this entry to index the one field as “toyota prius The
Toyota Prius offers better gas mileage than the Honda Insight.” This makes the

2
indexing and searching simple to implement but this is not an optimal solution as it
reduces the precision of the search. If a user searches for “honda insight”, the “toyota
prius” entry will be returned because it included the words “honda insight” in the
entry.

In this thesis, we seek to improve search accuracy by harnessing metadata informa-


tion. Our goal will be to use the traditional text field interface but process the query
like a field search. We call this a part-of-query(POQ) tagger because the query is seg-
mented and tagged with the associated metadata field. This approach will retain the
familiar text box interface but take advantage of processing structured data informa-
tion. To prove the effectiveness of the approach, we perform experiments testing the
classification and segmentation accuracy. In addition, we demonstrate the approach
by integrating the POQ tagger to a custom-developed search engine for NSF research
grants called ResearchWatch(http://researchwatch.net), and qualitatively compare
search tasks against the official NSF research grant search at http://research.gov.

1.1 Problems with Full Text Search

We setup a traditional full text search engine to show the problems associated with
full text search(also referred to as a keyword search). A query for “university of wash-
ington” most likely means the user intended grants from the organization“university
of washington.” In a full text search, we were provided with 5354 results. However,
only 1812 results were actually from the University of Washington. A full text search
engine will by default search for “university AND of AND washington.” Even if we
double quote the term to match the exact phrase, we still get 2126 results. The
phrase results included 8 possibly unrelated grants from Cornell University because
“university of washington” was mentioned in the grants abstract: “...organizing com-

3
mittee members are Dr. Philip Liu (Cornell University), Dr. Harry Yeh (University
of Washington...”

The situation is more troublesome when we want the search engine to aggregate infor-
mation about the results. Aggregated statistics for results can be a useful summary
of the results. If the query is imprecise as in the case of a full text search, then this
aggregation calculation is incorrect as well. Consider if we want to show the user the
total amount awarded for the grants in the result set. When we performed the full
text search of “university of washington,” we received the total awarded amount of
$2,692,464,893.0. This is incorrect as the real amount is $791,505,950.00.

1.2 Faceted Search

A faceted search system allows a user to search and filter content using a hybrid
search/browse interface. The interface provides structural metadata hints for the
user to refine their search. It has become a standard interface for e-commerce, digital
libraries and other types of applications. See figure 1.2 and 1.3 for examples of faceted
search. In a user study on faceted search[10], Kules reported a typical search task
was composed of a user spending 50 seconds looking at the results, 25 seconds on
the facets and only 6 seconds on the query. In fact, the study noted that the facets
appeared to be as important or more important than the results. The facets serve
as a distributive view and summary of the results. One particularly well-suited task
for faceted search is exploratory search. In exploratory search, users lack knowledge
to perform the meaningful search queries in the particular domain. Faceted search
helps exploratory search by giving the user query suggestions with the facets. In this
sense, faceted search merges browsing and searching together.

4
Figure 1.2: A selection of the ‘male’ facet for nobel prize winners in UC Berkeley’s
Flamenco nobel prize faceted search system.

One downside of some faceted search implementation is the lack of integration with
text search. In figure 1.3, we performed a search for “mark twain fiction” on Melvyl,
the UC Library Catalog site. Unfortunately, this only searches for the text body
rather than the metadata. The user’s intent for “mark twain fiction” most likely
meant books by the author “mark twain” in the category “fiction.” The results
included books in all categories, not just “fiction.” Our system aims to improve
faceted search by automatically selecting the particular facets using our POQ tagger.

5
Figure 1.3: Melvyl is a University of California Library search engine using a faceted
search interface.

6
1.3 Definitions

Throughout this thesis, we reference the terms below:

Facet
A facet is a component of an object. The principal investigator(PI) is a facet of
a research grant. Other commonly used names are fields, attributes, metadata
or filters.

Inverted Index
Search engines typically store their index using an inverted index. An inverted
index works by breaking up a document’s text and using the words as the
lookup keys and the document ids as the values. Other databases commonly
store data using document ids as the lookup keys and the full text as the values.
The inverted structure gives search engines efficient performance because users
search by words rather than document ids.

N-gram
An n-gram is a subsequence of n items of a sequence. In this thesis, n-gram
items are words. In information retrieval, an n-gram model is used to index
phrases for search. For example, the phrase “toyota prius” would be indexed as
one entity in a bigram indexing whereas it would split into “toyota” and “prius”
for a unigram indexing.

Precision and recall are common metrics for evaluating search results.

Precision
Precision measures how many of results returned are actually relevant.

7
Recall
Recall measures how many results the system was able to retrieve.

Features
Machine learning algorithms use features as the basis of accomplishing the
needed task(classification, clustering,etc). For example, a number of times the
word ‘drug’ appeared in a message can be a feature for a spam filter.

Supervised Learning
A machine learning algorithm that requires a set of correctly labeled data to
make inference about an unknown example. An example is given a set of spam
and real emails, a supervised learning spam filter’s goal is to predict if an un-
known email is spam.

Unsupervised Learning
In contrast, an unsupervised learning algorithm does not need any correctly la-
beled data to make inference about an unknown example. One example is given
a set of newsgroup messages from 20 unlabeled newsgroups, an unsupervised
learning algorithm can group the newsgroup messages into 20 clusters.

1.4 Dataset

Our data will be the 120,000 NSF Grant Abstracts from 1999 to 2009. Each abstract
is a semi-structured XML document. The provided data is a flat level tree per record.
We retrieved the dataset by writing a script to query the research.gov search for an
xml output. A list of the metadata fields is shown in table 1.1

Like most datasets, the data was not completely clean. We performed the bare min-
imum data cleaning processes to allow us to index the data into the open source

8
awardnumber awardedamounttodate
principalinvestigator piemailaddress
title abstract
co pinames programs
startdate lastamendmentdate
state organization
awardinstrument programmanager
expirationdate fieldofapplications
organizationstreetaddress city
organizationstate organizationzip
organizationphone nsforganization
nsfdirectorate programelementcodes
programreferencecodes

Table 1.1: NSF Grant Fields

search system, Apache Solr. Some data cleaning steps included whitespace trimming,
correct date formatting and invalid XML character removal. There were inconsis-
tent formatting and possibly invalid values which we did not remove. The name
fields, principalinvestigator, co pinames and programmanager had different formats.
Principalinvestigator was formatted“Last Name, First Name” while co pinames and
programmanager had the format “First Name Last Name.” We did not normalize
these name fields. There were some entries which we suspect were invalid. There was
a PI named “None, None.” “Brodsky, Emily” exists in the principalinvestigator field,
but there was also an entry with “Brodsky, Emily” in the organization field. City
names had different values for the same canonical city. The city “La Jolla” had en-
tries of “LA JOLLA” and “LaJolla.” We did not attempt fix these examples because
unclean data is a natural part of data sets and it is more representative of real data
sets.

9
1.5 Contributions

Our contributions can help researchers and practitioners:

* A unified query segmentation and facet classification algorithm for search queries
using structured data. Our approach uses a Bayes classifier for query segmenta-
tion and facet classification of search query terms. We coin the unified approach
as a part-of-query tagger.

* Our system is the first to use the NSF grants as a data set for research. We
propose other researchers make use of NSF grants as a data set for future
research.

* We present the design and implementation of our algorithms as a complete web


application, ResearchWatch.

* For government officials and researchers, ResearchWatch will be available as an


alternative search interface to the NSF’s official Research.gov grant search site.

1.6 Thesis Overview

This thesis is divided into three main chapters:

In Chapter 2, we cover the facet classification algorithm. Chapter 3 will integrate the
facet classifier with a query segmentation layer which together we call the part-of-
query tagger. Chapter 4 is a case study of implementing the proposed solutions on
the search engine, ResearchWatch.

We conclude in Chapter 5 with a discussion of the limitations of our approach as well


as possible future research directions.

10
Chapter 2

Facet Classification

When a user performs a search query using just a keyword search, there is ambiguity
in the search results. If the user searches for “solar ca” against a dataset of NSF
Grants, the user’s intent was most likely NSF grants involving “solar” in the state of
“CA”. However, a full text search will also include any results with the word “ca”.
In this case, results included “Ca” as in the symbol for calcium because case sensitive
is often ignored. The keyword search method maximizes the recall while a structured
search which searches for text within metadata fields maximizes the precision.

We hypothesize that most queries performed will be primary based on searches against
key metadata fields in the dataset. This is in contrast to how keyword search prior-
itizes the query. The goal in most keyword searches is to perform the most general
search possible. More specifically, performing the most general search is just search-
ing for each word from the search query. When there is a small amount of records in
a dataset, performing a general search is ideal as the user can sift through the short
list of results. However, when we have a large amount of records in our dataset, it
is more ideal to provide a better filter. Thus, we take a more restrictive stance and

11
tag each phrase in the query with a metadata field. If we can search for each phrase
according to its most likely field, we hypothesize the results will be more accurate.

2.1 Related Work

2.1.1 Structure of Search Queries

In 2004, Beitzel et al[3] reported search queries are often very short with an average of
2.2 words. They examined query logs from 50 million American Online users during
a 7 day period of Dec 2003 to Jan 2004. Additionally, 81% only viewed page one of
the results. Beyond query length, we investigated research related to the structure
and meaning of search queries.

There are commercial products that attempt to process queries as natural language
questions. Powerset is a recent application that aims to process such queries and pro-
vide answer rather than just documents which may contain the answer. For example,
a natural language query is “who is the current president of the usa?” A NLP search
engine would provide the answer “obama” rather than documents containing the an-
swer. Unfortunately, the majority of searches are not in natural question form.[2]
Most queries are written in short phrases and our work will be focused on queries
based on short phrases rather than grammatically correct questions.

2.1.2 Part-of-Speech Tagger

An approach to understanding a user’s intent is to first tag each query token with
its part of speech. By tagging a word with its part-of-speech (POS), the word can
be disambiguated. Consider the word “bike” has multiple meanings related to its

12
POS: the verb form can refer to “places to bike” or “how to bike” and the noun form
can mean “information about types of bike” or “i want to buy a bike”. Allan and
Ragahavan[1] proposed an integrated solution by combining a computational method
with a human interface for feedback. A POS tagger is executed on the content during
index time. Once users performs a query, they are prompted with the possible word
senses along with examples for disambiguation. In this chapter, we will focus our
research on computational algorithms and follow up on interface design in our later
chapter.

There are two main approaches for automated POS taggers: rules-based or probabilistic-
based. Rules-based taggers use a set of hand-written or machine generated rules to
assign tags to a word. For example, it can tag a capitalized word in the middle of
a sentence as a pronoun. Probabilistic taggers use a training set to learn the most
probable tag for a word. The probabilistic method is more flexible but requires much
work in generating good training data. The Brill Tagger is a well known transfor-
mative POS tagger in the NLP community[6]. The transformative tagger is a hybrid
of the two approaches. It uses a probabilistic approach to learn strong rules. Part-
of-speech tagging systems typically test using a Wall Street Journal corpus from the
Penn Treebank[15]. By training the Brill Tagger on WSJ corpus, Barr’s test on
search query logs yielded only a 48.2% accuracy rate. When the Brill Tagger was
re-trained with labeled queries, the accuracy rate became 69.7%. The explanation is
that the structure of search queries is very different from the structure of traditional
text corpus.

We choose not to use POS taggers in our system because our preliminarily experiment
of a POS tagger confirmed its reliance on grammatical structure isn’t present in search
queries. Many POS taggers relies on grammatical rules including capitalization for
proper nouns. Web queries do not abide by grammatical correct rules. In fact, Barr

13
showed capitalization by searchers is often incorrect and harms POS accuracy. To
better determine the user’s intent, Barr recommends using external information or
entity detection.

2.1.3 WordNet

WordNet[7] is a large English dictionary which groups words together into cognitive
synonyms known as synsets. A synset can be connected to another synset described
by a semantic relation such as hypernyms, hyponyms, holonym, meronym and others.
For our usage, we would like to have used WordNet to determine the hyponym of a
particular term. A hyponym describes a ‘is-a’ relationship as in a dog is a hyponym
of canine. Unfortunately WordNet is most useful for the general English literature.
It does poorly on more domain specific terms. We were able to use WordNet to
query the hyponym of“Stanford University” as “university”, but the majority of other
proper organization names came up blank including “Washington University” and
“University of California-Irvine.”

2.1.4 Named Entity Recognition

Named Entity Recognition(NER) is the process of recognizing named individuals or


organizations in a body of text. Traditional entity detection have been reported with
high accuracy performance. A state of the art NER tagger such as the Illinois Named
Entity Tagger has a best F1 score1 of 90.8 in the wikipedia corpus. [19] Like POS
tagging systems, NER relies on grammatically correct syntax to infer entities. For
instance, the interactive demo from [19] was able to detect “University of California
Irvine” as an organization but “university of california irvine” was not detectable.
1
F1 score is a combination of precision and recall scores

14
(a) (b)

Figure 2.1: Bing results for “barack obama” and “uci” show customized results due
to query entity recognition.

Named entity recognition in query is an important step at understanding the user’s


intent. A Microsoft study reports 20-30% of queries were solely name entities and
71% of queries included entities. [21] One particular application of NER for queries
can be seen in Microsoft’s Bing search. In figure 2.1 (a) and (b) respectively, we see
the Bing search results page for “barack obama” and “uci”. By knowing if the query
is for a person or organization rather than keywords, search engines can incorporate
more related information. We see the results for “barack obama” includes links for
biography and quotes which are only useful for a person entity. The results for “uci”
show metadata information about the university such as the number of students
enrolled.

To accomplish NER, researchers apply machine learning algorithms. In [8], Guo


applied named entity recognition to search queries applying a supervised learning
method known as Weakly Supervised Latent Dirichlet Allocation(WS-LDA) to query
logs. The probabilistic method represents each query as a triple (e, t, c) where e is

15
the entity, t the context of e and c the class. The learning problem is posed as:

T
Y
max p(ei , ti , ci ) (2.1)
i

As an example, the optimal triple for “harry potter walkthrough” would be the pa-
rameters (e = ‘harry’, t= ‘# walkthrough’, c=’game’).

NER is a more restrictive case of facet classification. Our facet classification classifies
a term against metadata fields. It makes no judgement to whether the term is an
entity. Our methods differ from Guo in that we start with a Naive Bayes classifier,
an arguably more simple supervised learning method.

2.1.5 Semantic Tagging

Semantic tagging is another similar topic presented in [14] where Medhi presents
a hybrid probabilistic/NLP method. Semantic tagging assumes a query has been
categorized into a particular domain such as product or job. Instead it focuses
on how to break the assumed query into its semantic components. For example:
“cheap garmin streetpilot c340 gps” can be tagged as “sortOrder:cheap Brand:garmin
Model:streetpilot Model:c340 Type:gps”.

Medhi’s work is different from other semantic extraction models such as hidden
markov models (HMMs), maximum entropy Markov models (MEMMs) [16], and con-
ditional random fields (CRFs) [11] because Medhi’s approach does not model based
on sequential ordering. Medhi reports sequential models are not suited for search
query tagging because sequential models treat input text as ordered text whereas
web search queries are more often a bag of words. A bag of words model uses term
counting as the fundamental feature and it does not consider ordering in the model.

16
Medhi cites the example of “garmin gps cheap” and “cheap garmin gps” as exam-
ples of the same query to support the theory that queries are mostly bag of words
construction.

Like our differences with NER, we seek to only to tag a query with a particular
metadata field. Our work is a similar to Medhi’s goal of tagging the terms but we
attempt to solve the problem using a more simple probabilistic supervised learning
model. Additionally, we use content itself as the learning model rather than query
logs.

2.2 Implementation

principalinvestigator Tomlinson, William


nsforganization IIS
organization University of California-Irvine
organizationstate CA
city Irvine
nsfdirectorate CSE

Table 2.1: Facets with an example record for classification.

We accomplish our problem of tagging a query with a metadata field by transforming


the problem to a classification problem. We consider the facets as classes and query
terms as the words to classify. For example, the “university of washington” should
be classified as an “organization” and “los angeles” should be classified as a “city.”
Figure 2.1 shows a list of the facets we will classify. To solve this classification
problem, we start with the Naive Bayes classifier.

17
2.2.1 Naive Bayes Classifier

Naive Bayes is a fast, robust and accurate classification model. The classifier relies
on a simple probabilistic method by using Bayes’ Rule with a “naive assumption” to
classify data. In this classification problem, we want to find the probability of a term
belonging to a category p(C|T ) where C and T are random variables for classes and
terms respectively. Let ci and tj be values of a class and term respectively. We want
to select the class ci which maximizes p(C = ci |T = tj ). p(C = ci |T = tj ) can be
expressed in terms of Bayes’ Rule:

p(T = tj |C = ci )p(C = ci )
p(C = ci |T = tj ) = PM (2.2)
k p(T = tj |C = ck )p(C = ck )

where p(T = tj |C = ci ) is the information we can compute using our training data and
p(C = ci ) is the prior probability of the class ci . The denominator is a normalizing
factor.

The query term random variable T is a vector feature composed of W1 ...Wn where
n is the length of the query term and each Wi is a random variable of the word’s
probability. We do this decomposition because our training data has the probability
for each Wi . Thus, we can re-write equation as:

p(W1 ...Wn |C = ci )p(C = ci )


p(C = ci |W1 ...Wn ) = PM (2.3)
k p(W1 ...Wn |C = ck )p(C = ck )

The “naive” part of the classifier comes from the fact that we assume that terms in
the conditional probability p(W1 , W2 ...Wn |C) are conditionally independent as seen
in figure 2.2. By assuming the terms are conditionally independent, we can expand

18
University of Washington
P(university,of,washington | C)

University of Washington
P(university | C) P(of | C) P(washington | C)

Figure 2.2: An example of conditional independence assumption in Naive Bayes.

the joint conditional probability into a product:

n
Y
p(W1 ...Wn |C) = p(Wj |C) (2.4)
j

This simplification allows the original joint conditional probability p(W1 ...Wn |C) to
be computable. In reality, we know the “naive” assumption is not true. Certain words
are unlikely to be independent. For instance, we know that the two words “HIV” and
“AIDs” will occur more frequently together than “HIV” and “grasshoppers.” Even
knowing that the independent assumption isn’t true, Naive Bayes classifier has been
reported to work extremely well. Zhang offers a detailed explanation of Naive Bayes’
optimality as a classifier in [22].

We replace the conditional probability with the product:

p(C = ci ) nj p(Wj |C = ci )
Q
p(C = ci |W1 ...Wn ) = PM Qn (2.5)
k j p(Wj |C = ck )p(C = ck )

We can now compute each term in the equation above. Given a query term <
W1 ...Wn >, we can compute the probability that it will be in a class ci . Now we

19
want to pick the most likely class C:

p(C = ci ) nj p(Wj |C = ci )
Q
arg max PM Qn (2.6)
j p(Wj |C = ck )p(C = ck )
ci
k

Since the denominator does not depend on ci , we can discard it without affecting the
result:

n
Y
arg max p(C = ci ) p(Wj |C = ci ) (2.7)
ci
j

To estimate p(C = ci ), we use the maximum likelihood estimate which is just the
frequency of the class ci over all training examples:

#(C = ci )
p(C = ci ) = (2.8)
|C|

To estimate the parameters p(Wj |C = ci ), we use term frequency from our training
set: #(Wj |C). However, there is a problem when we encounter unknown words.
There is insufficient data where the term frequency can be 0 in the training set. We
address this by applying Laplace smoothing which adds 1 to the frequency:

n
#(C = ci ) Y
arg max #(Wj |C = ci ) + 1 (2.9)
ci |C| j

Naive Bayes is an extremely efficient algorithm. The time complexity for training is
linear, O(m) where m is the number of words in all training examples. The training
is a single pass iteration across all words to build the P (Wi |C) entries as a word count
dictionary. Testing a target consists of looking up each word in the test document in
the word count dictionary. The time complexity is also linear, O(kn) where k is the

20
number of classes and n is the number of words in the test document.

2.3 Experiment

We ran our experiments on a collection of 120,000 NSF grants. In our experiment,


we selected six classes as the possible selections. Examples can be seen in figure 2.1.

It would be optimal to use the query logs to determine the likely search targets
performed by users. Unfortunately, we could not secure the information from the
NSF. Instead, we will use training data from the grant records. We split the provided
NSF grant records into a variable amount of grants for training and 4,000 grants
for testing. The testing set is disjoint from the training set, meaning there are no
overlapping grants. For example, the training set of 10,000 grants provided 59,983
fields. The 20,000 training set of grants provided 119962 fields. The test set of 4000
grants provided 23987 fields. For each grant, we used each field-value pair as an
training example. If a grant had a missing field-value pair, we skipped the pair but
we still included the grant’s other non-null field-value pairs.

Training Grants Training Examples Total Correct / Total Ran Accuracy


1000 5998 19539.0 / 23987 0.815
2500 14995 20322.0 / 23987 0.847
5000 29993 20789.0 / 23987 0.867
10000 59983 21265.0 / 23987 0.887
20000 119962 21822.0 / 23987 0.910
30000 179928 22139.0 / 23987 0.923
40000 239862 22375.0 / 23987 0.933
50000 299665 22517.0 / 23987 0.939
60000 359639 22667.0 / 23987 0.945

Table 2.2: Facet Classification Accuracy Table using Naive Bayes.

21
Figure 2.3: Facet Classification Accuracy Table using Naive Bayes.

22
2.4 Discussion

Term Probability
p(“university 00 |“organization00 ) 1.202
p(“of 00 |“organization00 ) 1.132
p(“washington00 |“organization00 ) 1.007
p(“university 00 |“city 00 ) 1.01
p(“of 00 |“city 00 ) 1.00
p(“washington00 |“city 00 ) 1.01

Table 2.3: Naive Bayes probability of word given class calculations. Each probability
is the count of the class term frequency divided by the total amount of terms in the
class. It is greater than 1 because of Laplace smoothing which is used to safeguard
against unknown words.

The results are shown in table 2.2. Even with minimal training examples, the simple
Naive Bayes classifier has high accuracy. The dataset itself is very separable because
for the most part, the metadata fields contains very little overlap. Principal investi-
gators are people, organizations are places and cities are locations. In table 2.3, we
have an example of the probabilities for the term “university of washington.” We see
that the probability came down to just choosing between “organization” and “city.”
The word“of” is given a score of 1 because the word did not appear in the “city” class
and we assign it with a default score of 1 because of laplace smoothing. When the
words for each class are multiplied, we see a higher probability for “organization.”

p(“universityof washington00 |“organization00 ) = 1.202∗1.132∗1.007 = 1.37 (2.10)

and

p(“universityof washington00 |“city 00 ) = 1.01 ∗ 1.0 ∗ 1.01 = 1.0201 (2.11)

The other fields did not even have a occurrence of the words and thus had default
scores of 1.0.

23
In our implementation, we used the full string for calculating the probabilities. We did
not remove stopwords like “and” or “of.” While stopwords are used in many keyword
search systems, stopwords can be useful information for distinguishing between two
different entities. Retaining stopwords is also important decision because we consider
the probabilities as a bag of words. More specifically, we do not consider ordering
in our model. Had we used stopwords, a query for “university of washington” would
have removed “of” and “washington university” would have been the same match.

2.4.1 Errors

Predicted Correct
- pi state city nsfdirectorate organization nsforganization
pi 3520 - 8 - 13 -
state 10 3964 4 - 1 -
city 64 29 3842 - 139 -
nsfdirectorate - - - 4000 - 5
organization 14 - 87 - 3837 -
nsforganization 2 - - - - 3993

Table 2.4: The confusion matrix for Naive Bayes classification predictions with 60,000
training examples.

Test Entry Predicted Actual


UNIVERSITY PARK organization city
GEO nsfdirectorate nsforganization
LA city state
Florence principalinvestigator city
de Silva, Shanaka state principalinvestigator

Table 2.5: Examples of Naive Bayes classification errors.

The confusion matrix can be seen in table 2.4. This confusion matrix shows the
predicted classes and it’s actual classes. The majority of misclassification was with
organization and city. The two classes have overlapping data. “Stanford” the city and
“Stanford” the university are both likely candidates. PIs were mis-classified because

24
many PI names are not frequently occurring. This is especially true of PIs with
non-western names. Misclassification examples can be seen in table 2.5.

Some of the errors were unavoidable: GEO occurs in NSF Directorate 15584 times
while the same word GEO occurs in NSF Organization 352 times. Naturally, by
predicting GEO will occur in NSF Directorate, we will have an error if whenever the
user meant the NSF Organization.

The PI name “de Silva, Shanaka” was mistaken as a state because “de” is also the
state initial for Delaware. “Silva” occurred 29 times and “Shanaka” occurred 6 times.
However, since the training set only takes a subset of the entire content, the two
examples may not have been selected. A possible solution is to use case sensitivity
because “de” is different from “DE.” Unfortunately the data for other fields are not
normalized. For the city “La Jolla”, we saw both “La Jolla” and “LA JOLLA.”
Considering case sensitively in our model would have broken the non-normalized
entries.

LA is an interesting case: The word“LA” occurs in the city facet because “La Jolla”
and “La Crosse”. UC San Diego and the affiliated Scripts Research Institute are
both located in La Jolla and surprisingly, receives more grants than the whole state
of “LA” or Louisiana. La Jolla organizations received 1678 grants with $930 million
while Louisiana received 1131 grants with $362 million.

2.5 Conclusion

Compared to other classifiers such as neural networks or SVMs, Naive Bayes is very
transparent in its predictions. The transparency is an attractive point for system
implementors in determining why one query was classified a certain class. This is

25
especially true when the classes have overlapping features. For example, if both classes
organization and city have the word “Washington”, we can easily discover which class
is more probable because the predicted value is just the maximized product of term
frequencies from each class.

It is also important to note the speed of a Naive Bayes classifier. Both training and
testing ran in a range of 3-10 seconds. We implemented the algorithm in Scala2
using the Sun Hotspot Java Virtual Machine(JVM) 1.6. The only adjustment was
setting the JVM heap space to -Xmx1024m for increased memory. The reminder of
the settings were the defaults. Because of the linear time complexity, a Naive Bayes
classifier can scale very well for both training and testing.

While the results have shown high accuracy, each entry in the test set has already
been segmented because the test set was extracted from the data itself. For a more
real world study, a more likely search query could contain more than one facet. For
example, “stanford university open source” is a search query for grants at “stanford
university” mentioning “open source.” In the next section, we utilize query segmen-
tation to segment a search query and tag each segment with the likely facet.

2
Scala is a hybrid functional and object oriented programming language on the JVM platform.

26
Chapter 3

Unifying Facet Classification with


Query Segmentation

Query segmentation is the problem of separating a search query’s tokens into phrases
to reflect the user’s intent. A correct segmentation will improve the precision because
the exact query phrase will be matched as opposed to separate keyword matches. For
example, a typical search for “chinese restaurants new york city” is also expressed
a combination of each term: “chinese AND restaurants AND new AND york AND
city.” A more precise search would be grouping each phrase: “chinese restaurants”
and “new york city.”

In the previous chapter, we introduced facet classification and used content from
the dataset for evaluation. We showed that the Naive Bayes classifier can achieve an
accuracy of 94% on a test set restricted to one facet per testing record. In this section,
we tackle the problem of facet classification on search queries which may contain more
than one facet. We create a unified algorithm which uses a Bayes classifier to segment
and facet classify phrases.

27
(solar) (university) (of) (washington)
(solar university) (of) (washington)
(solar university of) (washington)
(solar university of washington)
(solar) (university) (of washington)
(solar) (university of) (washington)
(solar) (university of washington)

Table 3.1: Possible segmentations for “solar university of washington”

Formally, we define the query segmentation problem as:


Given: a sequence of words: S = w1 ...wn
Output: a sequence of grouped words: (w1 ..wx )(wx+1 ..wn )
If a researcher needs to find information on solar research at the University of Wash-
ington, a realistic query could be: “solar university of washington.” The possible
segmentations are listed in table 3.1. The segmentation which best represent the
user’s intent is (solar) (university of washington).

While generating all possible ordered segmentations will be O(2n ), n is a low number
and typically will be less than 8. Because n is low, testing against all possible ordered
segmentations is possible and still valid for a scalable system.

3.1 Related Work

A simple approach to query segmentation is to require the user to double quote


the phrases. Many search engines use the double quotes as a common practice to
distinguish a phrase from a set of keywords. However, this method requires the user
to be aware of the feature and thus is not frequently used.

We discuss a variety of query segmentation algorithms currently in the literature.


The main limitation of the current research in query segmentation is the focus on

28
unstructured text data. This is not surprising since the majority of the Internet is
unstructured data. Thus, we cannot use any existing work directly because our data
set contains structured data. However, query segmentation research on unstructured
data still warrants a look as some of the key ideas still apply to our problem.

3.1.1 Mutual Information

One of the earliest work we found on query segmentation was in [9]. Risvik et al built
a query segmentation system using connexity measures. The connexity measure is
defined as the product of the sequence frequency and the mutual information of the
longest two sub-sequences:

connexity(S) = f req(S)I(w1 , , , wn−1 ; wn+1 , , , wn ) (3.1)

where Mutual Information is defined as:

 
XX p(x, y)
I(X; Y ) = p(x, y) log (3.2)
y∈Y x∈X
p(x) p(y)

p(x, y) is the probability of both terms appearing in the same query and p(x) and
p(y) are probabilities of just each term appearing in a query. Query logs are used to
calculate the probabilities. Mutual information is limited by two word usage. [20] For
our purpose, we need to be able to consider single entities as a possible segment as
opposed to the two word minimum. While we cannot solely use mutual information,
we use it to supplement our segmentation scoring.

29
3.1.2 Noun Compound Bracketing

Noun Compound(NC) bracketing has generally been a research topic in natural lan-
guage processing rather than query segmentation in information retrieval. Bergsma
and Wang noted the similarity between query segmentation and noun compound
bracketing[4]. NC bracketing determines the structure of the NC’s binary parse tree
rather than just grouping phrases together. The primary problem in NC bracketing is
determining if the noun compound is left bracketed or right bracketed. For example,
the phrase “personal data policy” has a left bracketing of “[[personal data] policy]” or
a right bracketing of “[personal [data policy]].” The typical approach is the compare
the bigram occurrence of each bracketing: “personal data” vs “data policy.”

Lapata and Keller’s work in [12] showed using n-gram statistics on large corpora
training sets is competitive with many supervised learning algorithms. However, their
solution for noun compound bracketing using a large training set did not outperform
complex supervised learning models trained on smaller corpora. Nakov and Hearst[17]
took the idea of using statistics from a very large corpus further by using statistics
from a web search engine rather than a text corpus. They applied a variation of
Lapata and Keller’s work on noun compound bracketing to produce higher accuracy
than the supervised learning models based on a smaller corpora.

3.1.3 Machine Learning Approaches

A supervised learning method is a more traditional and straight forward approach


for query segmentation. Bergsma and Wang use NLP features such as POS tags and
basic occurrence statistics. In mutual information, the terms p(x, y), p(x) and p(y)
are all tied in one calculation. Instead, Bergsma and Wang use these basic features
to allows the learning algorithm to adjust the weights of each raw feature rather than

30
grouping all three features p(x, y), p(x), p(y) into one calculation.

Tan and Peng[20] take a different approach from previous work by using an unsu-
pervised learning method. Their system consists of three components: a generative
language modeling, an EM algorithm step and using Wikipedia as a source for in-
ferring concepts. Their use of the EM algorithm is unique in that it only runs on a
sub-corpus level rather than the entire corpus.

Zhang et al’s work on unsupervised query segmentation reported a improved score


over Tang and Peng’s work. [5] Their system operates in an eigenspace1 to compute
likely word co-occurrences.

Tang and Peng’s rationale for using unsupervised learning method over a supervised
method is that manually annotating a large dataset is too costly. We do not run into
the annotation problem and can use a supervised approach to make use of our labeled
data.

Our work is most similar in goal to Li et al’s approach in [13]. Li et al approached the
problem of query tagging a query term to a pre-defined category using conditional
random fields. They applied their algorithms to e-commerce data which may exhibit
different properties than our NSF grants dataset. Their phrase matching is limited to
unigram and/or bigram indexing. This approach is restrictive in our domain because
terms can be longer such as “university of washington.” Their work did not compare
their techniques to more general supervised learning algorithms like the Naive Bayes
classifier. In the spirit of Occam’s Razor, our approach aims build the a more simple
model to solve a similar problem.
1
A similar method to PCA, Latent Semantic Indexing has been noted to outperform TFIDF
in a traditional IR problem under the theory that the noise reduction is the reason for its better
performance. [18].

31
3.1.4 External Information

A common theme among the work in query segmentation is the use of external infor-
mation. Tang and Peng’s work[20] uses Wikipedia as the features for unsupervised
learning. Wikipedia is frequently updated and using new snapshots will mean incor-
porating the latest trends in the English language for query segmentation. Their work
incorporated a probability score adjustment by counting the number of the Wikipedia
titles and links which contain the candidate query segment. Zhang et al’s work uses
Google to count the number of occurrences of the particular query segment candidate
as a feature in their query segmentation system.

Information from the Internet or any general purpose data source are useful sources
but it can come at the expense of noise. For example, “CSE” in the domain language
of NSF grants most likely means the NSF Directorate “Computer & Information
Science and Engineering.” In the general corpus of the web, using statistics for
the term “CSE” will match “council of science editors” or Google’s “Custom Search
Engine.” To avoid noisy data, we restrict our evaluation to our NSF grants dataset.

3.2 Implementation

In the previous chapter, we constructed our Naive Bayes classifier without regard to
query segmentation. We assumed the each provided test term consisted of only one of
the facets. Incorporating query segmentation will support queries based on multiple
facets.

While many of the related work included high accuracy in their experiments, they
were only used solely for query segmentation. Our implementation needs to work
in conjunction with facet classification. To use algorithms mentioned in the related

32
work, we would need to create a two-pass algorithm: one for query segmentation and
one for facet classification. Our unified approach uses the classification as the feature
score for the query segmentation.

Our structured data gives us the option to to avoid ngram indexing of the full text
data. Since Ngram indexing is exponential in growth, we avoid the lengthy indexing
and reduce memory space requirement. The structured data provided already gives
us the necessary phrases without the noise introduced with ngram indexing. In the
phrase, “..grants at University of Washington..”, “at University” is counted as an
entry.

3.2.1 Design Parameters

There are four design parameters we need to consider in our system: max number of
tokens per segmentation, ordering importance, stemming and stopwords.

If the max number of tokens per segmentation is high, the system can return more
precise results. But if it is too high, the number of computations will increase as the
growth is O(2n ). We do not place a restriction on the max number of tokens and
selected the parameter to be total amount of tokens.

If we do consider ordering of the terms, then we will get exact phrase matches. The
phrase query “john smith” will always search for “john smith.” If we do not consider
ordering in the algorithm, we will also match “smith, john.” Because the field values
for names exhibit this ordering problem, we do not consider ordering in our approach.
Additionally, phrase ordering is more likely a problem when the field value is large as
in the case of the abstract text itself. However, the facets we are searching across all
have token lengths less than 10 so this is not a major problem. In real world situations,

33
we may want to apply our part-of-query tagging to external search systems. We have
no control over how external services index their data. Such systems can have their
name fields as either “john smith” or “smith, john.” This design parameter does
make it more flexible at the cost of less precision.

Stemming algorithms like the Porter stemmer reduces words to a base form. The
Porter stemmer achieves the reduction with a set of rules. For example, if a word
ends in ‘ing’, the stemmer would remove it. This is useful when we consider the two
words ‘run’ and ‘running.’ A query for either words would match both words. We
choose not to stem because the words in our field values are more entity-based.

Another consideration is the usage of stopwords. Stopwords are a list of commonly


used words like ‘the’, ‘a’ and ‘of.’ In a keyword search, search engines remove words in
a stopword list to increase the precision. Additionally, removing stopwords increase
runtime performance. An inverted index has a mapping of a token to a document
list. If a user searches for ‘to be or not’, it is likely every document contains one of
these words so the document list would be the entire document space. We choose
not to use stopword removal because it allows our part-of-query tagging system to
help distinguish phrases. If we had used stopword removal (and not consider term
ordering), the ‘of’ in “university of washington” would have been removed and we
would have had a match for “washington university.”

3.2.2 Mutual Information and Generalized Bayes Classifier

Our approach will use a generalized Bayes classifier. Given a query, we will compute
all possible permutations of the query. In each permutation, we assign each possible
segment a facet probability score. We select the highest additive segmentation score of
the permutations as our segmentation choice. To model a limited dependent structure

34
of phrases, we modified our original Naive Bayes classifier to use mutual information
if the query contains two or more tokens. The mutual information score will help
separate terms which occur in two categories. Our revised model includes the absolute
value of the point-wise mutual information:

n−1
Y p(wi , wi+1 |c)
p(d|c) = p(wi |c)abs(log( )) (3.3)
i
p(wi |c)p(wi+1 |c)

p(wi , wi+1 |c) is the probability of the the two terms appearing together in the given
category c. p(wi |c) and p(wi+1 |c) are probabilities of each term appearing in the given
category c. The point-wise mutual information of wi and wi+1 measures how likely
these two words co-occur. If the two words are independent and not likely to co-occur,
then we have that:

p(wi , wi+1 |c) == p(wi |c)p(wi+1 |c)

This will result in the point-wise mutual information being equal to 0 and effectively
zero out p(d|c). This is useful characteristic because if the two words do not co-
occur, the probability of the words being a phrase is zero as well. If the terms are
not independent, than the ratio p(wi , wi+1 |c) and p(wi |c) ∗ p(wi+1 |c) is a measure of
their dependence. The higher the value, the more likely they are dependent. We
compute the point-wise mutual information of each succeeding token pairs. We only
use mutual information if there is more than one token in the query.

Now that we have incorporated mutual information into the scoring, our classifier
has become a generalized Bayesian classifier because the mutual information scoring
removes the independent assumption made in a Naive Bayes classifier.

35
3.2.3 Heuristics

We added a heuristic to assign all tokens occurring below a word frequency parameter
to a default class called “text.” The “text” field is composed of all the text of
a particular record in one field. This field will be used as a full-text search. We
need this catch-all field to search on terms which are not frequently occurring in the
structured data fields. Both words in “solar energy” have a low probability in the
“organization” field because there are some organizations with “solar” or “energy”
such as “PrimeStar Solar” or “Department of Energy.” Thus, the word frequency
parameter will ensure these rare terms will be searched on the “text” field.

3.3 Experiment

We generated a test set of 214 queries1 by conjoining terms from one facet with
another facet or text from the grant abstract. Each generated query is guaranteed to
return at least one result because each query was constructed from a grant record’s
fields. There was no need for a human annotator because annotation is accomplished
by the structured data. We selected terms from the following facets: organization,
principalinvestigator, nsfdirectorate, city, state. An example generated test query
would be “fuel cell university of california-irvine” and the correct segmentation would
be “text:“fuel cell” organization:“university of california-irvine”.”

It would not be valid to compare our approach with other query segmentation al-
gorithms due to the different goals. The query segmentation algorithms mentioned
in the related work focused on full text content while our algorithms is targeted at
structured data. For the related query tagging work, we were not able to attain the
1
See Appendix A for the source code which generates the test set.

36
Figure 3.1: Query segmentation and classification accuracy.

datasets.

We compared our Naive Bayes classifier with our generalized Bayes classifier using
mutual information to test the independence assumption in Naive Bayes. Addition-
ally, both classifiers will be tested with different word frequency thresholds. We
measured the performance of our query segmentation with an accuracy test. A test
query was considered correct if the terms were both segmented and classified to the
correct facet.

37
Method WF Threshold Correct / Total Accuracy
0 145/214 0.696
1 152/214 0.710
2 166/214 0.775
Naive Bayes Classifier
3 166/214 0.775
4 166/214 0.775
5 166/214 0.775
0 163/214 0.761
1 168/214 0.785
2 184/214 0.859
Bayes Classifier using MI
3 184/214 0.859
4 184/214 0.859
5 184/214 0.859

Table 3.2: Query segmentation and classification accuracy comparing Naive Bayes
assumption and word frequency threshold parameter.

3.4 Discussion

The results are listed in figure 3.1 and raw numbers in table 3.2. Our experiments
tested the accuracy of the independence assumption of Naive Bayes and the effect of
the word frequency threshold.

In both our base Naive Bayes classifier and our generalized classifier, the word fre-
quency threshold converged with a parameter value of two. The word frequency
threshold reduces the effect of infrequently occurring terms. When the word fre-
quency threshold is a low value, the infrequently occurring terms are misclassified. In
one instance, the term “solar” is categorized as an “organization” because there are
existing grants with organization names such as “PrimeStar Solar.” However, a user
who searches for “solar” most likely intends the term to be searched in the text field
rather than the organization field. There are other terms which exhibit this same
behavior like “fuel cell” or “cancer.”

Our results show that modeling dependence between terms for query segmentation
is important. The Naive Bayes classifier assumption is that the terms in a phrase

38
are independent. This assumption is harmful when the terms are dependent. The
independent assumption contributes each term’s score for a category individually.
Our Naive Bayes classifier mistaken “stanford university” for ‘city’ rather than ‘or-
ganization.’ Individually, the terms “stanford” and “university” both appear in ‘city’
more often than ‘organization’ but they do not appear in ‘city’ as one record. By
using mutual information to model dependence between the terms, we can classify
“stanford university” as an ‘organization’ correctly. The mutual information score for
“stanford” and “university” is 40.135 in ‘organization’ and 0.0 in ‘city’ because the
two terms do not co-occur. Using our baseline Naive Bayes classifier, our results con-
verged at 77.5% whereas adding mutual information to model dependence increased
the accuracy to 85.9%.

Query - -
Incorrect city:”san diego” organization:”cancer”
san diego cancer
Correct city:”san diego” text:”cancer”
Incorrect text:”raytheon” text:”solar”
raytheon solar
Correct organization:”raytheon” text:”solar”

Table 3.3: Incorrect segmentation/classification examples

The other incorrect results were infrequent ‘organizations’ and incorrectly labeled
grant abstract text. See table 3.3 for incorrect examples. ‘Raytheon’ is a defense
company which only had 5 grants and these grants may not have been in the training
data. Interestingly, ‘cancer’ was classified as an organization because there are 41
organizations with the term ‘cancer’ in the organization name.

3.5 Summary

In this chapter, we provided an overview of the current work in query segmentation.


Our work is unique in that we create a simple model for our POQ tagger and apply it

39
to structured data. In our implementation, we decided not to use ordering importance,
Porter stemming or stopword removal in our model. Ordering importance was not
used because of the likelihood that names can be either form of “First Last” or “Last,
First.” We avoid Porter stemming because we want to keep each term unaltered.
Stopword removal is commonly used to remove noisy words such as “a” or “of” but
for short phrases, stopwords can be very important in deciding between “university
of washington” and “washington university.”

We use query segmentation to extended our previous chapter’s work in facet classi-
fication to segment on phrases. We decided to add a word threshold parameter to
assign infrequently occurring terms to a general “text” field which would text search
through all the fields. Our first system simply applied our Naive Bayes classifier on
every subsequence and selected the class with the maximum score. We tested the
validity of the independent assumption in our Naive Bayes classifier by comparing
the classifier to a second classifier where we added mutual information to model de-
pendence between terms. We found that a generalized Bayes classifier using mutual
information had more accurate results of 85.9% vs 77.5% for the Naive Bayes classifier.
The tests were performed on a test set which were generated from the dataset.

It is unfortunate that we cannot compare our approach’s results to related work[13]


because of different data sets. Their experiments used internal product datasources
and query logs from Microsoft which are not publicly accessible. The NSF grant
dataset is a freely available dataset and can be used by any future part-of-query
tagging systems. We believe that our baseline part-of-query tagger using Naive Bayes
is a valid baseline system for comparison with future work.

Our next chapter will describe the implementation of our approach to build Research-
Watch.

40
Chapter 4

ResearchWatch: Case Study

We demonstrate the real world applicability of our part-of-query tagger by incorpo-


rating it into a search engine for NSF research grants called ResearchWatch. We will
first discuss the motivation behind a need for accessible search engines on government
data. It will be followed with a technical discussion of how we incorporated our POQ
tagger into a search engine. Then, we discuss how our POQ tagger complements and
helps our the user interface. Finally we compare search tasks qualitatively between
ResearchWatch and Research.gov, the official search engine for NSF research grants.

4.1 Motivation

The government’s push for open and transparent data with data.gov as a repository
for government data is just the first step in providing citizens with actual visibility of
the data. While the provided datasets are available, they are not readily accessible
or understandable to end-users. When these datasets are large, they cannot be easily
browsed with an off-the-shelf program like a spreadsheet program. To make the data

41
understandable and useful, the data will require a computation backend such as a
search engine or a database along with an intuitive interface to service lookups. We
built ResearchWatch as a model for exploring a representative government dataset.

4.2 Search Engine Components

To avoid reinventing the wheel, we used an open source search engine package, Apache
Solr 1.4. Our task then consists of structuring our POQ tagger into a modular format
to be easily plugged into Solr.

4.2.1 Apache Lucene

Apache Lucene is a popular search library written in Java. Solr relies on Lucene to
provide the basic search capabilities. It was built by Doug Cutting in 2000 when he
needed a search library for a search engine. It is important to know what Lucene
can and cannot do. Lucene provides developers the ability to index and query for
documents. On the indexing side, Lucene transforms documents using pre-defined(or
custom) analyzers and stores them in a compact inverted index. On the query side,
Lucene provides a set of query parsers to map both simple and complex search queries
for retrieving documents from the index. Lucene also provides a set of scoring func-
tions to rank the documents.

Although Lucene provides many capabilities of a search application, it is missing


external API access capabilities. Programmers who need to index or query data from
Lucene have to create their own access layer upon Lucene. Apache Solr attempts to
solve this much repeated task by creating a robust API layer.

42
Browser

Application Layer
Ruby on Rails App
User

Search System
Apache Solr

Index Data Importer

Admin
Crawler

NSF Grant XML


web service

Figure 4.1: A high-level architecture view

4.2.2 Apache Solr

Solr is search server built using the Lucene search library. As seen in figure 4.1, the
application layer issues web-service HTTP calls to Solr via XML and Solr responds in
a specified format: XML, JSON, etc. Solr was created by Yonik Seeley while building
a product search engine for CNet. It has since been popularized in the computer
industry and is used by many of the top 100 websites including Yelp, Netflix and
Zappos.

Solr provides developers with a HTTP request processing API, caching functionality,
administration UI, XML configuration files and common search components. In terms
of search features, one of the key components is the built-in faceting component. As
mentioned in the introduction, faceted search is a strong search UI for exploratory

43
search because it provides users with search and browsing capabilities at the same
time. We make use of faceted search in our search engine which will be covered in
more detail in the interface section.

4.3 Application Layer

Because Solr only provides a web service-like API for search results, we need to build
an application layer to provide an interface for users. Our application layer is built
using Ruby on Rails. Ruby on Rails is a web framework built using a Model-View-
Controller(MVC) pattern to process user requests into viewable html pages. Because
the application layer is not the focus of this thesis, we refrain from discussing the
application layer in detail.

Typically, web search applications typically keep the data in a database and use the
search engine for indexing/querying. For the purposes of this project, we choose to
store all the fields of each grant using Solr. It is more efficient to keep only necessary
information stored in Solr to keep the index small but for simplicity sake of less moving
parts, we chose to store all fields using Solr and avoid using a database system.

Solr allows programmers to write custom components which can be chained to-
gether to create a custom search request handler. Existing components include
QueryComponent, FacetComponent, MoreLikeThisComponent, StatsComponent and
a DebugComponent. We choose to build an interface to our POQ tagger using a com-
ponent module called PartOfQueryComponent.

We have an overview of the search system in figure 4.2. The application layer passes
Solr a query request. The SearchHandler’s role is to use a query request to build
both SolrQueryRequest and SolrQueryResponse objects. The SearchHandler will

44
1 HTTP Request XML/JSON
Query from formatted
Application Layer response

Solr Search System


QueryRequest

2 QueryResponse
ResponseWriter
RequestHandler:
SearchHandler

QueryRequest
QueryRequest
QueryResponse
QueryResponse
6
3
PartOfQueryComponent QueryComponent

4 5
Raw query reformulated query

Query Segmentation Naive Bayes Classifier

Figure 4.2: An overview of our search system.

call the prepare, process and handleResponses methods of its listed components with
SolrQueryRequest and SolrQueryResponse as its arguments. In other words, if the
SearchHandler has SearchComponent and StatsComponent as its two components,
SearchHandler will first call SearchComponent.prepare(request, response), StatsCom-
ponent.prepare(request, response) and then SearchComponent.process(request, response),
StatsComponent.process(request, response). In each of these method calls, the com-
ponents will alter the SolrQueryResponse object.

45
Figure 4.3: ResearchWatch front page.

In our case, our PartOfQueryComponent is the first of the component list. In the
prepare stage, PartOfQueryComponent performs the query reformulation using the
query segmentation and classifier algorithms discussed in the previous two chapters.
By hooking into the Solr in this manner, we are able to minimally reformulate the
query and allow Solr to perform the remainder of the search processing operations in
the other components.

Solr allows field searches by using the syntax: “field:value” and we use this format to
perform the query reformulation. For example, if the query string is “solar university
of washington,” the PartOfQueryComponent will perform the query segmentation
and classification to reformulate the query as “text:solar organization:“university of
washington”” Then Solr performs its remaining operations with the other search
components including the main QueryComponent which performs the actual search
operation.

4.4 User Interface

Our search interface starts with a user typing in their query into a traditional search
box seen in figure 4.3. On the results page, we display three main user interface(UI)
components: the faceted search navigation, aggregated data displays and the search

46
results.

4.4.1 Faceted Search

One of the benefits of faceted search is allowing a user to search and browse at the
same time. This paradigm is possible when the dataset contains structured data.
The common design pattern for faceted search is usually a left handed sidebar which
contains a list of facets and their values. A user can filter the result set by one or
more of these facets. See figure 1.3 for an example of faceted search. Our POQ tagger
helps faceted search by ensuring the user that the predicted value is either correct or
incorrect. In figure 4.4(a), we see the POQ tagging of “stanford university” correctly
showing a nonzero count in the organization facet while the other facet values are
listed as zero. When the POQ tagger is not used, the user must select the facet value
after they perform their initial keyword search in figure 4.4(b).

4.4.2 Aggregated Data Displays

With structured data, we get the benefit of being able to aggregate data from the
result set. However, if the query is ambiguous, the aggregated data can be misleading.
For example, if a user’s query is for the total amount awarded for “biodiversity stan-
ford university,” the keyword search had grant awarded information from “Stanford,
Jack” of “University of Montana.” A correct POQ tagging would display only aggre-
gated results from organizations of the name “stanford university” and the keyword
“biodiversity.” See figure 4.4 for a comparison of ResearchWatch with and without
our POQ tagging.

47
(a)

(b)

Figure 4.4: Search results for “biodiversity stanford” with and without part-of-query
tagging.
48
4.4.3 Search Results

The results are ranked according the document’s sum of term-frequency inverse-
document-frequency (TF-IDF) for each search token. This is the standard ranking
policy used by Lucene. In a keyword search, the TF-IDF score would count all the
terms in the text field where the text field is a conjunction of all fields into one. The
POQ tagging preprocesses the query such that it would try to search for each token
with the most likely field. For example, a query for “solar university of washington”
would be POQ tagged with organization:“university of washington” and text:solar.
The system would count all the terms within the specified field. This is in contrast
to the keyword search which would search all the tokens in the text field.

4.5 Qualitative Comparison With Research.gov

We perform a qualitative comparison with Research.gov, the official government


search for NSF grants. A qualitative comparison is more appropriate than a quanti-
tative experiment because we seek to determine the extend of how each search system
can perform a selected task rather than perform a full HCI study. We generated a
list of search tasks with the input of faculty members from UC Irvine.

Each task description is followed by an analysis of how to complete the task in both
systems:

List the grants related to biodiversity at Stanford University.

Research.gov: The main search interface is a field search as seen in figure 4.5. We
first tried a search query of “biodiversity stanford university” in “ Awardee or Award
Information” field the but received zero results. We also tried “biodiversity stanford”

49
Figure 4.5: Research.gov’s main search interface, a field search.

but got zero results. Strangely, we received 2000 results for queries for “stanford
university” and “biodiversity” as separate queries but zero together.

ResearchWatch: We used the query “stanford university biodiversity” and received


14 grants in the results. The POQ tagger was able to recognize “stanford university”
as an organization and selected it as the facet value. “biodiversity” was applied to
the keyword search because there were no facets matching this term.

A list of professors at University of California-Irvine who received NSF grants during


a given time period, sortable by amount they received.

Research.gov: We used the query “University of California-Irvine” and selected the

50
time period of year 2009. We were unable to perform a group sort by the amount but
we were able to sort based on individual grant amount. Each professor/PI is listed
with the grant and a user can tally and group manually, but it can be tedious when
the list of grants is large.

ResearchWatch: We used the query “University of California-Irvine 2009.” The POQ


tagger was able to recognize “University of California-Irvine” and used “2009” as a
keyword search. The POQ tagger is limited by only being able to process 2009 as a
text rather than include it in date range queries. The faceted search interface allows
a user to instantly see the list of professors/PIs with their amount of grants received.
The listing is not sorted by amount received but it is possible to implement such a
feature at the application layer.

How much money has been spent on research, on a yearly basis, that included the
term “climate change” from 2000 to 2009.

Research.gov: We used the query “climate change.” We received 2000 results but
it appears that 2000 is an upper limit of total amount of grants returned in Re-
search.gov’s search system. In reality, there are more grants containing the phrase
“climate change.” We confirmed this limitation in Research.gov by doing a date
search on a year range which also returned 2000 results. Because of this limitation, a
user cannot accurately tally up the total amount awarded in one query. To complete
this task, a user would have to make 10 separate queries and manually tally up the
grant amount awarded information.

ResearchWatch: We used the same query “climate change” and received 4369 results.
ResearchWatch is not bounded by an upper limit on the results. This enables a user
to visualize a wider set of information such as the number of grants grouped by year
or by state. Although the graph only displays the number of grants rather than the

51
total amount awarded, the user can select the date facet to find the amount awarded
each year. Our POQ tagger had no effect on this query because the query was not
tagged to any facet.

4.6 Conclusion

The larger goal of ResearchWatch is to facilitate information discovery on the NSF


research grants dataset by improving both the relevancy of the results as well as the
user interface. Our POQ tagger improves the search interface to help users more
easily find information by integrating the text field input box with faceted search
and utilizing aggregated data displays. These UI improvements helped us complete
the hypothetical search tasks as seen in our comparison of ResearchWatch to Re-
search.gov. In reviewing the tasks, we noticed that ResearchWatch is limited by data
type support. Our POQ tagger works only with string data types and is not setup to
work with date or numeric types for range queries. Solr does support range queries
using syntax such as “startdate:[* TO TODAY]” but it would be more natural to
support part-of-query tagging for dates and numeric fields. We found the other lim-
itations of ResearchWatch are missing features such as selectable group by fields.
These limitations can be implemented at the application layer and not affected by
the POQ tagger. To conclude, we propose that Research.gov incorporate some of the
useful methods in Researchwatch such as our POQ tagger or faceted search as it can
greatly help scientists find information in the NSF research grants.

52
Chapter 5

Conclusion

Arguably the most popular form of a search interface is the single text box. Typi-
cally, the text box is used to perform a keyword search but a keyword search lacks
precision. This thesis proposed a part-of-query(POQ) tagger which segments a query
and classifies the search terms to a facet(also known as a metadata field). Previ-
ous work in query segmentation have mainly looked at the problem from a keyword
search perspective. We compared two POQ tagger implementations: one based on
the Naive Bayes classifier and a generalized Bayes classifier utilizing mutual infor-
mation to model dependence. The Naive Bayes classifier performed worse because
of the independence assumption of the tokens in a search query. The POQ tagger
using the generalized Bayes classifier achieved a reasonable accuracy of 85% on the
generated test set. We created ResearchWatch to implement the POQ tagger in a
search engine for NSF grants but our POQ system can be applied to any dataset
with structured data. Our implementation of ResearchWatch was directly compared
to the official research grant search site, Research.gov and we show that our POQ
tagger helped us complete search queries. This thesis can be a foundation for future
studies in POQ tagging and its related topics including “named entity recognition in

53
query”(NERQ)[21] and “semantic tagging”[14].

5.1 Future Work

We hope that our study of part-of-query tagging for search queries inspire more
activity in both research fields and commercial application.

In commercial settings, the tagger system can work particularly well in mobile appli-
cations, especially when paired with a speech recognition package. Imagine a workflow
where the speech recognition package can translate the query into text and the POQ
tagger parses the text to infer a search intent from the user. This can be particularly
useful in the automobile setting. For instance, if a user is driving to San Diego, she
can speak “tacos san diego” and have a Yelp-like application process this query to
infer a search query of “resturant name:“tacos” city:“san diego””.

A key limitation of our POQ tagger was the compatibility with only textual data.
Date and numeric values were treated as text. How can we utilize dates or numeric
fields? Are regular expression matching rules sufficient?

Our system is not robust against fuzzy matches such as spelling mistakes and synonym
matches. In ResearchWatch, it is very likely a user would search for the synonym term
of “uci” rather than full term “university of california irvine.” How could a system
incorporate synonyms or spelling mistakes? A preliminary investigation revealed
that domain specific knowledge is important. If a system were to use features from
the web, a general Google or Wikipedia search for “uci” shows results for numerous
organizations besides the university entity such as the “International Cycling Union”.
Additionally, how will the term scoring be computed with likely synonyms?

54
A future study of the POQ tagger on various datasets, perhaps a larger dataset with
more facets such as e-commerce product database, can show how the accuracy would
scale with more different types of data. It would also be important to compare our
work with similar work like in [13] by using a common dataset. We think that the
NSF research grant is a viable option as a baseline dataset as it is readily available.

We also would like to see how a POQ tagging system can be effectively generalized for
the entire web rather than a subset dataset. Can semantic web technologies(RDF,etc)
help? Can structured data resources such as DBpedia or Freebase be used?

55
Bibliography

[1] James Allan and Hema Raghavan. Using part-of-speech patterns to reduce query
ambiguity. In SIGIR ’02: Proceedings of the 25th annual international ACM
SIGIR conference on Research and development in information retrieval, pages
307–314, New York, NY, USA, 2002. ACM.

[2] Cory Barr, Rosie Jones, and Moira Regelson. The linguistic structure of english
web-search queries. In EMNLP ’08: Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing, pages 1021–1030, Morristown, NJ,
USA, 2008. Association for Computational Linguistics.

[3] Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, and
Ophir Frieder. Hourly analysis of a very large topically categorized web query
log. In In SIGIR 04: Proceedings of the 27th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 321–
328. ACM Press, 2004.

[4] Shane Bergsma and Qin Iris Wang. Learning noun phrase query segmentation.
In EMNLP-CoNLL 2007, 2007.

[5] Xia Hu Tingzhu Huang Chao Zhang, Nan Sun and Chua Tat-Seng. Query seg-
mentation based on eigenspace similarity. In 47th Annual Meeting of the Associ-
ation for Computational Linguistics and the 4th International Joint Conference
on Natural Language Processing of the Asian Federation of Natural Language
Processing. ACL-IJCNLP, 2009.

[6] Mumit Khan Fahim Muhammad Hasan, Naushad UzZaman. Comparison of dif-
ferent pos tagging techniques (n-grams, hmm and brills tagger) for bangla. In
International Conference on Systems, Computing Sciences and Software Engi-
neering (SCS2 06) of International Joint Conferences on Computer, Informa-
tion, and Systems Sciences, and Engineering (CIS2E 06), 2006.

[7] Christiane Fellbaum, editor. WordNet An Electronic Lexical Database. The MIT
Press, Cambridge, MA ; London, May 1998.

[8] Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. Named entity recognition
in query. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR

56
conference on Research and development in information retrieval, pages 267–
274, New York, NY, USA, 2009. ACM.

[9] T. Mikolajewski K. M. Risvik and P. Boros. Query segmentation for web search.
In The Twelfth International World Wide Web Conference, 2003.

[10] Bill Kules, Robert Capra, Matthew Banta, and Tito Sierra. What do exploratory
searchers look at in a faceted search interface? In JCDL ’09: Proceedings of the
9th ACM/IEEE-CS joint conference on Digital libraries, pages 313–322, New
York, NY, USA, 2009. ACM.

[11] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data, 2001.

[12] Mirella Lapata and Frank Keller. Web-based models for natural language pro-
cessing. ACM Trans. Speech Lang. Process., 2(1):3, 2005.

[13] Xiao Li, Ye-Yi Wang, and Alex Acero. Extracting structured information from
user queries with semi-supervised conditional random fields. In SIGIR ’09: Pro-
ceedings of the 32nd international ACM SIGIR conference on Research and de-
velopment in information retrieval, pages 572–579, New York, NY, USA, 2009.
ACM.

[14] Mehdi Manshadi and Xiao Li. Semantic tagging of web search queries. In ACL-
IJCNLP ’09: Proceedings of the Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP: Volume 2, pages 861–869, Morristown, NJ, USA,
2009. Association for Computational Linguistics.

[15] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building
a large annotated corpus of english: the penn treebank. Comput. Linguist.,
19(2):313–330, 1993.

[16] Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum en-
tropy markov models for information extraction and segmentation. In ICML ’00:
Proceedings of the Seventeenth International Conference on Machine Learning,
pages 591–598, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers
Inc.

[17] Preslav Nakov and Marti Hearst. Search engine statistics beyond the n-gram:
application to noun compound bracketing. In CONLL ’05: Proceedings of the
Ninth Conference on Computational Natural Language Learning, pages 17–24,
Morristown, NJ, USA, 2005. Association for Computational Linguistics.

[18] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh


Vempala. Latent semantic indexing: a probabilistic analysis. In PODS ’98:
Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on
Principles of database systems, pages 159–168, New York, NY, USA, 1998. ACM.

57
[19] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named
entity recognition. In CoNLL ’09: Proceedings of the Thirteenth Conference
on Computational Natural Language Learning, pages 147–155, Morristown, NJ,
USA, 2009. Association for Computational Linguistics.

[20] Bin Tan and Fuchun Peng. Unsupervised query segmentation using generative
language models and wikipedia. In WWW ’08: Proceeding of the 17th inter-
national conference on World Wide Web, pages 347–356, New York, NY, USA,
2008. ACM.

[21] Sarthak Shah Xiaoxin Yin. Building taxonomy of web search intents for name
entity queries. In 19th International World Wide Web Conference, 2010.

[22] H. Zhang. The optimality of naive bayes. In Proceedings of the 17th International
FLAIRS conference. AAAI Press, 2004.

58
Appendices

A Appendix A

A.1 Test Query Generation

We used the below to generate our test queries when we tested our POQ tagger.

def writeTestList{
val organizationFieldName = "organization_ft"
val organizationCityFieldName = "organizationcity_ft"
val nsfdirectorateFieldName = "nsfdirectorate_ft"
val stateFieldName = "state_ft"
val principalinvestigatorFieldName = "principalinvestigator_ft"
val keywordsFieldName = "text"

val organization = List("university of washington", "university of california-irvine",


"columbia university", "Massachusetts Institute of Technology", "Stanford University", "raytheon")
val organizationcity = List("san diego", "new york", "austin", "boston")
val nsfdirectorate = List("mps","eng","bio","cse")
val principalinvestigator = List("william smith", "william tomlinson")
val state = List("ca", "ny", "wa", "ma", "tx")
val keywords = List("solar", "fuel cell", "wind", "biodiesel", "battery", "cloning",
"aids", "cancer", "map reduce", "open source")

val l1= for(x <- organization; y <- nsfdirectorate)


yield List(x -> organizationFieldName, y -> nsfdirectorateFieldName)
val l2= for(x <- nsfdirectorate; y <- organizationcity)

59
yield List(x -> nsfdirectorateFieldName, y -> stateFieldName)
val l3= for(x <- organizationcity; y <- keywords)
yield List(x -> organizationCityFieldName, y -> keywordsFieldName)
val l4= for(x <- organization; y <- keywords)
yield List(x -> organizationFieldName, y -> keywordsFieldName)
val l5= for(x <- principalinvestigator; y <- keywords)
yield List(x -> principalinvestigatorFieldName, y -> keywordsFieldName)
val l6= for(x <- nsfdirectorate; y <- state)
yield List(x -> nsfdirectorateFieldName, y -> stateFieldName)
val l7= for(x <- state; y <- keywords)
yield List(x -> stateFieldName, y -> keywordsFieldName)

val testList = l1 ::: l3 ::: l4 ::: l5 ::: l6 ::: l7

val answerKey = testList.map{


query => {
(query.map(y => y._1).mkString(" "),
query.map(y => "%s:\"%s\"".format(y._2, y._1)).mkString(" "))
}
}

val filename = "src/main/resources/querysegmentTestData/test.tsv"


write(filename, answerKey.map(q => q._1 + "\t" + q._2).mkString("\n"))
}

60

Potrebbero piacerti anche