Search Accuracy Analytics White Paper

Fall
Search Accuracy Analytics

Paul Nelson
White Paper
Version 2.6.3
November 2014
2014 Search Technologies
14
Search Accuracy Analytics White Paper
Table of Contents
1
Summary................................................................................................................................ 5
1.1
The Impact of Poor Accuracy .............................................................................................. 5
1.2
The Solution to Poor Accuracy ............................................................................................ 5
1.3
A Reliable, Step-By-Step Process ......................................................................................... 6
1.4
A User-Focused Approach ................................................................................................... 6
1.5
A Comprehensive Approach ................................................................................................ 8
Problem Description ............................................................................................................... 9
Gathering and Auditing Log Files ........................................................................................... 10

3.1
Search Logs ........................................................................................................................ 10
3.2
Click Logs ........................................................................................................................... 11
3.3
User Information ............................................................................................................... 11
3.4
Log Cleanup ....................................................................................................................... 12
3.5
Auditing the Logs ............................................................................................................... 13
Index and Log File Snapshots ................................................................................................ 14
The Search Engine Score ....................................................................................................... 15

5.1
User-Based Score Model ................................................................................................... 15
5.2
What is Relevant to the User?........................................................................................... 15
5.2.1
Using Logs to Identify Relevant Documents ............................................................. 16
5.2.2
Gradated Relevancy versus Binary Relevancy .......................................................... 16
5.2.3
Documents are relevant to users, and not to queries.............................................. 16
5.3
Score Computation............................................................................................................ 17
5.3.1
The Scoring Factor .................................................................................................... 18
5.4
Interpreting the Score ....................................................................................................... 18
5.5
User and Query Scoring..................................................................................................... 19
5.6
Engine Quality Predictions / Quality Regression Testing .................................................. 19
Additional Metrics ................................................................................................................ 20

6.1
Query Metrics.................................................................................................................... 20
6.2
Relevancy and Clicks.......................................................................................................... 20
6.3
Result Metrics.................................................................................................................... 20
6.4
Document Metrics ............................................................................................................. 21
Continuous Improvement ..................................................................................................... 22

7.1
The Continuous Improvement Cycle ................................................................................. 22
7.2
Continuous Improvement Requirements.......................................................................... 23
7.3
Recording Performance from Run to Run ......................................................................... 23
Manual Relevancy Judgments ............................................................................................... 25

8.1
The Relevancy Judgment User Interface ........................................................................... 25
8.2
Advantages ........................................................................................................................ 25
8.3
Statistics from Manual Judgments .................................................................................... 26
Using Logs to Analyze User Intent ......................................................................................... 27

9.1
Query Analysis ................................................................................................................... 27
9.1.1
Top Queries .............................................................................................................. 27
9.1.2
Randomly Selected Queries...................................................................................... 27
9.1.3
Overall Internal Search Usage .................................................................................. 27
9.1.4
Randomly Selected Users ......................................................................................... 28
9.1.5
Help Desk Analysis .................................................................................................... 28
9.2
Tooling for Query Analysis ................................................................................................ 28
9.3
User Interface Click Analysis ............................................................................................. 30
9.3.1
Feature Usage ........................................................................................................... 30
9.3.2
Sequences ................................................................................................................. 30
9.4
Long Tail Analysis .............................................................................................................. 31
9.4.1
Query Database ........................................................................................................ 31
9.4.2
Supporting Databases............................................................................................... 31
9.4.3
Tooling Support ........................................................................................................ 32
9.4.4
Long Tail Analysis Outputs ..................................................................................... 32
9.4.5
More Information ..................................................................................................... 33
10 A/B Testing .......................................................................................................................... 34

11 Conclusions .......................................................................................................................... 35
Search Accuracy Analytics

Top quality search accuracy is not achieved with technology alone or through a one-time quick fix.
It can only be achieved with a careful, continuous improvement process driven by a varied set of
metrics. And the starting point for such a process is an objective, statistically valid measurement of
search accuracy.
When search results are not satisfactory or relevant enough to users, search development teams
often analyze the problem by looking at accuracy metrics from a query perspective. They ask
questions like: What queries worked? What queries are most frequently executed? What queries
returned zero results? And so on. In contrast, this paper presents a broader, user focused approach
to search relevancy. We ask the question is the user satisfied? And are the results worthy of
further user action?
Search Technologies recommends using this white paper as a reference guide. The techniques in this
paper have been used successfully by Search Technologies with a number of customers for first
computing search engine accuracy metrics and then using those metrics to iteratively improve search
engine relevancy using a reliable, measurable process.
1 Summary
The number one complaint about search engines is that they are not accurate.
Customers complain to us that their engine brings back irrelevant, old, or even
bizarre off-the-wall documents in response to their queries.
This problem is often compounded by the secretive nature of search engines and search engine
companies. Relevancy ranking algorithms are often veiled in secrecy (described variously as
intellectual property or more quaintly as the secret sauce). And even when algorithms are open to
the public (for Open Source search engines, for example), the algorithms are often so complex and
convoluted that they defy simple understanding or analysis.
1.1 The Impact of Poor Accuracy

What makes this situation all the more frustrating for customers is the impact of poor accuracy to the
bottom line.
For Corporate Wide Search Wasted employee time. Missed opportunities for new business.
Re-work, mistakes, and re-inventing the wheel due to lack of knowledge sharing. Wasted
investment in corporate search when minimum user requirements for accuracy are not met.
For e-commerce Search Lower conversion rates. Higher abandonment rates. Missed
opportunities. Lower sales revenue. Loss of mobile revenue.
For Publishing Unsatisfied customers. Less value to the customer. Lower renewal rates.
Fewer subscriptions. Loss of subscription revenue.
For Government Unmet mission objectives. Less public engagement. Lower search and
download activity. More difficult to justify ones mission. Incomplete intelligence analysis.
Missed threats from foreign agents. Missed opportunities for mission advancement.
For Recruiting Lower fill rate. Lower margins or spread. Unhappy candidates assigned to
inappropriate jobs. Loss of candidate and hiring manager goodwill. Loss of revenue.
1.2 The Solution to Poor Accuracy

When organizations encounter poor relevancy from their search engine, they usually have one of
two reactions:
A. Buy a new search engine.
B. Give up.
Both of these approaches are unproductive, expensive and wasteful.
The difficult truth is that all search engines require tuning, and all content requires processing. The
more tuning and the more processing you do, the better your search results will be.
Search engines are designed and implemented by software programmers to handle standard use
cases such as news stories and web pages. They are not delivered out-of-the-box to handle the wide
variety and complexity of content and use case that are found around the world. They need to be
tuned, and part of that tuning is to process the content so that it is easily interpretable.
And so there is no easy fix, no silver bullet and no substitute to a little bit of elbow grease (aka
hard work) when it comes to creating a satisfying search result.
1.3 A Reliable, Step-By-Step Process

This paper gives a step-by-step process to solve the problem of poor search engine relevancy. The
major steps are:
1.
Gather, audit, and process log files.

a.
Query logs, click logs, and other indicative user activity.
2.
Create a snapshot of the log files and search engine index for testing.
3.
Compute engine score.
4.
Compute additional search metrics.
5.
Implement and score accuracy improvements (the continuous improvement process).
6.
Perform manual relevancy judgments (when practical).
7.
Use logs to analyze user intent.
8.
Perform A/B testing to validate improvements and calculate Return on Investment (ROI).
These steps have been successfully implemented by Search Technologies and are known to provide
reliably, methodical, measurable accuracy improvements.
1.4 A User-Focused Approach

When discussing what documents are relevant to what queries, it is common for different parts of
the organization to have different goals and opinions as to what is a good document. For example:
(Functional) Do the results contain the words the user entered?
(Visual perception) Is it easy to see that the results contain good documents?
(Knowledge) Does it answer the deeper question implied by the query?
(Marketing) Do the results enhance the perception of the corporate brand?
(Sales) Do the results lead to more sales?
(Editorial) Do the results highlight editor selections?
With many competing goals for search, it is easy to get lost when trying to figure out what is
important and what should be fixed (and why).
Most search accuracy metrics are from a query perspective. They ask questions like: What queries
worked? What are the most frequently executed queries? What queries returned zero results? How
do I improve my queries? etc.
In contrast, this paper presents a user focused approach to search accuracy:
1.
All accuracy evaluation is from the users point of view.
2.
All that matters does the user find results of interest?
We are interested, in this paper, as to the central question of is the user satisfied? We attempt to
answer this question by analyzing user activity to see if the search engine is providing results which
the user has found worthy of further activity.
The user-centered approach is a powerful approach with some subtle consequences. For example, if
a query is executed by 10 different users, then that query will be analyzed 10 times once from each
users point of view (analyzing the activity stream for each user individually).
Further this approach is much more accurate by providing scores which are automatically normalized
to the user population. A very small number of highly interactive users will not adversely skew the
score, unlike with query-based approaches.
And finally, user-based scoring provides more detailed information for use in system analysis. It
scores every user and every users query (as well as the system as a whole). It identifies the least
effective queries and the most effective queries, and traces every query back to the user session so
the query can be viewed in context.
What about the other perspectives? Arent they important too?

Of course they are. But first you need to understand how your end-users view your search results.
Once you have a solid understanding of this, you can then include other perspectives (such as brand
awareness, editorial selections, etc.) into the equation.
E-commerce and Increasing Sales

Finally, the user-based approach is the best approach for increased e-commerce sales. What matters
most to e-commerce is how many customers (i.e. users) purchased products from the query results.
With e-commerce, you are less concerned about the query, and more concerned about the
customer. What is most important is: did the query return something that the customer purchased?
How many customers that executed the query ended up purchasing something? What queries lead
to sales? What queries never lead to sales?
The user-based approach can answer all of these questions. The user-based approach can aggregate
all customer activity and leverage that user activity to determine the success (or failure) of every
query for every user, from each customers unique point of view.
Search Technologies has shown that this approach can dramatically improve conversion rates for ecommerce sites. In one implementation, we produced a 7.5% increase in conversion rate for
products purchased based on search results for one site, 3% increase overall validated by A/B
testing.
1.5 A Comprehensive Approach

Finally, this paper represents a comprehensive approach. The approach includes:
Log file analysis
Automatic engine scoring
Continuous improvement
Manual scoring
User intent analysis
A/B testing
Note that not all techniques will be appropriate for all situations. Some techniques are appropriate
for production systems with sufficient log information to use for analysis while others are better for
evaluating brand-new systems. Some techniques are for on-line analysis and others are for off-line
analysis.
Search Technologies recommends using this white paper as a reference guide. We have architects
and data scientists who can help you determine which methods and processes are best suited to
your situation. Once you have decided on a plan, we have lead engineers, senior developers, and
experienced project managers who can ensure that the plan is delivered efficiently, on-time, and
with the best possible outcome.
2 Problem Description
What often happens with fuzzy algorithms like search engine relevancy is that some
improvements make relevancy better (more accurate) for some queries and worse
(less accurate) for other queries.
Only by evaluating a statistically valid sample set can one know if the algorithm is better in the
aggregate.
For each new release of the search engine, the accuracy of the overall system must be
measured to see if it has improved (or worsened).
o
Simple bugs can easily cause dramatic degradation in the overall quality of the
system.
Therefore, it is much too easy for a new bug to slip into production unnoticed.
The problem is exacerbated by the size of the data sets.
Any algorithm which operates over hundreds of thousands of documents

and queries must have a very large test suite for continuous statistical
evaluation and improvement.
The relative benefit of each parameter of the search relevancy algorithm must be measured
individually. For example:
o
How much does field weighting help?
How much will other query adjustments (weighting by document type, exact phrase
weighting, etc.) help the relevancy?
How much do synonyms help?
How much will link counting and popularity counting help?
A data-directed method for determining what types of queries are performing poorly, and
what fixes are most likely to improve accuracy needs to be implemented.
o
Instead, this information on what queries to fix comes from anecdotal end-user
requests and evaluations.
Fortunately, standard, best-practices procedures do exist for continuous improvement of search

engine algorithms. Further, these procedures have been demonstrated in real-world situations to
produce high quality, statistically verifiable results. User centered approaches provide more reliable
statistics which better correlate to end-user satisfaction.
3 Gathering and Auditing Log Files

Clean, complete log files gathered from a production system are a critical for
metrics analysis. There are two types of log files required for accuracy
measurements: search logs and click logs.
3.1 Search Logs

Search logs contain an event for every search executed by every user. Every event should contain:
1.
The date and time (down to the second) when the search was executed.
2.
A search ID which uniquely identifies the search.

a.
This is optional, but highly desired to connect searches with click log events.
b.
If the user executes the same search twice in a row, each search should have a unique ID.
3.
The user ID of the user who executed the search.
4.
The text exactly as entered by the user.
5.
The query as tokenized and cleaned by the query processor.
6.
Search parameters as submitted to the search engine.
7.
Whether the query was selected from an auto-suggest menu
8.
The number of documents found by the search engine in response to the search.
9.
The starting result number requested.

a.
A number > 0 indicates the user was asking for page 2-n of the results.
10. The number of rows requested (e.g. the page size of the results).
11. The URL from where the search was executed (like the Web log referer field).
12. A code for the type of search.
a.
Should indicate if the user clicked on a facet value, an advanced search, or a simple
search.
13. A list of filters (advanced search or facets) turned on for the search.
14. The time it took for the search to execute (milliseconds).
Notes:
Some of these parameters may be parsed from the URL used to submit the search.
Test searches should be removed (such as searches on the word test or john smith).
Searches from test accounts should be removed (see auditing logs, section 3.5 below).
10
Automatic searches executed in response to browsing or navigation should be removed.

o
Searches executed automatically behind the scenes when the user clicks on a link
should be removed.
External links directly to the search engine
Other sorts of administrative searches (for example, to download a list of all documents for
index auditing) should be removed.
3.2 Click Logs

In addition to searches, accuracy testing needs to capture exactly what the user clicked in response
to a search.
Generally, click logs should contain all clicks by all users within the search user interface. This
includes clicks on documents, tabs, advanced search, help, more information, etc.
But in addition, clicks on search results should contain the following information:
1.
The time stamp (date and time to the second) when the search result was clicked.
2.
The document ID (from the index) or URL of the search result selected.
3.
The unique search ID (see previous section, 3.1) of the search which provided the search
result.
4.
The ID of the user who clicked on the result.
5.
The position within the search results of the document, where 0 = the first document returned
by the search engine.
In order to do this, search results must be wrapped (they cannot be bare URLs), so that when clicked,
the event is captured in the search engine logs.
3.3 User Information

[Optional] User information can help determine when subsets of the user population are being
poorly served by the search engine.
User information should be available for search engine statistics and cross referenced by user ID.
Useful user information to gather includes:
Start date of the user within the organization
Business unit(s) or product line(s) to which the user belongs or is subscribed to
Manager to whom the user reports (for employees)
Employer of the user (for pubic users) if available
11
Geographic location of the user
Job title of the user, if available
Further, e-commerce and informational sites can identify users, especially when they have logged in,
their population groups and interaction history, using cookies or back-end server data.
Gathering additional information about users is optional, and can be deferred to a future
implementation when the basics of search engine accuracy have been improved.
3.4 Log Cleanup

Just gathering logs will not be enough. All logs will contain useless information that is not related to
actual end-user usage of the system. This useless information will need to be cleaned up before the
logs are useful.
This includes:
Queries by monitoring programs (test probes)
Queries by the search team
Common queries used in demonstrations (water, john smith, big data)
Log file entries which are not queries or user clicks at all (HEAD requests, status requests,
etc.)
Multiple server accesses for a single end-user entered query

o
Clicks on page 2, 3, 4 for a query

o
These should be categorized as such and set aside
In other words, simply clicking on page 2 for a query should not increase the total
count for that query or the words it contains.
Clicks on facet filters for a query

o
These should be categorized and set aside
Canned queries automatically executed (typically through the APIs) to provide summary
information for special use cases
o
Where a click on the [SEARCH] button actually fires off multiple queries behind the
scenes
In other words, queries where the query expression is fixed and executed behind the
scenes
On public sites, there may be spam such as random queries, fake comment text containing
URLs, attempts to influence the search results or suggestions, or even targeted probes.
12
These should be detected and removed where possible.
Log cleanup is a substantial task by itself and requires some investigative work on how the internal
search system works. It must be treated carefully.
3.5 Auditing the Logs

Accuracy metrics will only be as good as the accuracy of the log files gathered. Therefore, log auditing
is required. This is accomplished by creating and logging in as a test account and interacting with the
system. Each interaction should be manually logged.
Once complete, the log files should be gathered and the manual accounting of the actions should be
compared with the automatically generated logs. Discrepancies found should be investigated and
corrected.
Log auditing can and should be automated and executed periodically. Manual log auditing will still be
required whenever a new version of the system is released.
Log auditing should be performed both before and after log cleanup to ensure that log cleanup
does not remove important log information.
13
4 Index and Log File Snapshots

Most systems are living systems with constant updates, new documents, and user
activity. But for systems under evaluation, especially those which are being tuned
and refined, constant updates means changing and non-comparable metrics from
run to run.
Therefore, Search Technologies recommends creating a snapshot of data and logs at a certain
period of time. All accuracy measurements will be performed on this snapshot.
This further means:
The snapshot must include a complete data set and log database.
Any time-based query parameters will need to be fixed to the time of the snapshot.
o
For example, relevancy boosting based on document age or freshness
Naturally, the snapshot will need to be periodically refreshed with the latest production data.
However, when this occurs:
o
The same engine and accuracy metrics must be computed on both the old and new
snapshots.
This way, score changes based merely on differences in the document database and
recent user activity can be determined and factored into future analysis.
14
5 The Search Engine Score

This section describes the metrics and reports which should be computed to
determine search engine accuracy. Multiple reports are required to look at
accuracy from a variety of dimensions.
The most important report is the search engine score. This is an overall judgment of how well the
search engine is meeting the user need.
5.1 User-Based Score Model

The Search Engine Score in this section is user based. This means that the model takes an end-user
perspective (rather than a search engine perspective) on the problem. Results are gathered by user
(or session) and then analyzed within the user/session for accuracy. These user-based statistics are
then aggregated to create the score for the engine as a whole.
This user-based model provides a more accurate description as to how well users are being served by
the search engine. In particular:
What percentage of users are satisfied?
What is the average user score?
Who are the most and least satisfied users?
Similarly, when queries are scored, these scores use data derived from the user who executed the
query. If a query is executed by multiple users, then that query is judged multiple times from each
users individual perspective.
This provides useful, user-based information on queries:
What are the queries which satisfy the highest percentage of users?
What are the queries which satisfy the smallest percentage users?
What is the overall query score?
What are the most and least satisfying queries to the users who execute them?
5.2 What is Relevant to the User?

The search engine score requires that we understand what documents are relevant (of interest) to a
particular user or session. The ultimate goal of the search engine is to bring back documents which
the user finds to be interesting as quickly and as prominently as possible.
15
5.2.1 Using Logs to Identify Relevant Documents

There are many different ways to identify documents which are relevant or of interest to a user:
Did the user view the document?
Did the user download or share the document?
Did the user view the product details page?
Did the user add to cart the product?
Did the user purchase the product?
Did the user hover-over to view details of the product or document preview?
All of these signals indicate that the user found the document (or product) sufficiently worthy of
some additional investigative action on their part. It is these documents that we consider to be
relevant to the user.
5.2.2 Gradated Relevancy versus Binary Relevancy

A question at this point is whether there should be a gradation of relevancy. For example, should
add to cart be more relevant than product details view? Should product purchase be more
relevant than add to cart? Should documents which are viewed longer be more relevant than
documents viewed for a short amount of time?
The answer to all of these questions, is no.
The problem with gradations of relevancy is scale. Is product purchase 2x relevant than add to cart?
Or only 1.5x? Attempting to answer these sorts of questions will skew the results in hard-to-predict
ways.
And so, Search Technologies recommends simply choosing some threshold for all documents which
are relevant (value = 1.0) and all other documents are, therefore, non-relevant (value = 0.0). This
avoids all complexity and uncertainty with choosing relevancy levels, star-systems, etc. which
confuse and skew the statistics.
5.2.3 Documents are relevant to users, and not to queries

Note that, in the above discussion, documents are considered to be relevant to the user and not to
the query. This gets to the core of the user-based engine score, that we are creating a model for the
user: queries entered by the user and documents which the user has indicated are interesting (in
some way).
It does not matter, to the engine score model, which query originally retrieved the document. The
document is relevant for all queries entered by the user.
16
It is the assumption of this model that users will execute multiple queries to find the document or
product which they ultimately want. They may start with shirt, then move to long sleeved shirt,
then to long sleeved dress shirt and so on. Or their first query may be misspelled. Or it may be
ambiguous. Another way of saying this is that a relevant document found by query #3 is also relevant
if it was returned by query #1.
In all of these examples, it is desirable for the search engine to short circuit the process and bring
back relevant documents earlier in the query sequence. If this can be achieved, then the user gets to
their document or product much faster and is therefore more satisfied.
5.3 Score Computation

Computing the user scores requires a statistical analysis of the search logs, click logs, and search
engine response generated by the previous section.
The algorithm is as follows:
Gather the list of all unique queries across all users ALL_QUERIES
1.
a.
2.
Send every query to the search engine and gather the results RESULTS_SET[q]
Loop through all users, u = 0 to <num-users> - 1

a.
Accumulate a (de-duplicated) list of all documents clicked by the user

RELEVANT_SET
b.
Loop through all queries executed by the user, i = 0 to <num-queries> - 1 Q[i]

i. Look up the query and results in the search engine response RESULTS_SET[Q[i]].
ii. Set queryScore[i] = 0
iii. Loop through each document in RESULTS_SET[Q[i]], k = 0 to <num-results> - 1
(1) If the document is in RELEVANT_SET
(a) queryScore[I] = queryScore[i] + power(FACTOR, k)
(see below for a discussion of FACTOR)
c.
3.
userScore[u] = average of all queryScore[i] values for the user
Compute the total engine score = average of all userScore[u] values for all users
Notes:
A random sampling of users can be used to generate the results, if computing the score
across all users requires too much time or resource.
All search team members should be removed from engine score computations, to ensure an
unbiased score.
17
5.3.1 The Scoring Factor

The FACTOR is chosen based on the typical user behavior or how much sensitivity the score should
have to matching documents lower in the search results. Factors must be between 0.0 and 1.0.
A high factor (for example, 0.99) is good for identifying relatively small changes to the score based on
small improvements to the search engine.
A low factor (for example, 0.80) is good for producing a score which is a better measure of user
performance, for systems that produce (say) 10 documents per page.
1
0.9
'K = 0.5
'K = 0.667
'K = 0.75
'K = 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
5.4 Interpreting the Score

The resulting score will be a number which has the following characteristics:
Score = 1.0
o
Score < 1.0

o
The first document of every query was relevant
Relevant documents were found in positions lower in the results list
Score > 1.0

o
Multiple relevant results were returned for the query
Depending on the data, scores can be as low as 0.05 (this typically means that there are other
external factors affecting the score such as filters not being applied, etc.). A score of 0.25 is generally
thought to be very good.
Note that the score does have an upper limit, which can be computed based on the factor:
18
Maximum Score (all results relevant) = 1/ (1-FACTOR)

And so if the factor is 0.8 the upper limit of the score (all results are relevant) is 5.0. This assumes
analyzing a sufficiently large number of results. When only analyzing the top 10 results (for example),
the maximum score for FACTOR=0.8 will be 4.463.
5.5 User and Query Scoring

The scoring algorithm from section 5.1 can be accumulated on a per-user or per query basis. This
provides a useful comparison across users and queries to compute the following metrics:
1.
Top 100 lowest scoring (i.e. least satisfied) users with more than 5 queries.
2.
Top 100 highest scoring (i.e. most satisfied) users with more than 5 queries.
3.
Top 100 lowest scoring queries executed by more than 5 users.
4.
Top 100 highest scoring queries executed by more than 5 users.
5.
Most underperforming of the top 10% most frequently executed queries.
6.
Engine score per unit, for all users from each unit.
7.
Engine score per location, for all user from each location.
8.
Engine score per job title, for all users with the same job title.
This information is critical for determining what is working, what needs work and how the search
engine is serving various sub-sets of the user community when evaluating search engine accuracy.
5.6 Engine Quality Predictions / Quality Regression Testing

Finally, the algorithm above provides a way to perform off-line testing of new search improvements
before those improvements are put into production.
To make this happen, the RESULTS_SET (step 1.a of the algorithm from section 5.1) is recomputed
using the new search engine. With a new RESULTS_SET, the new score can be recomputed for the
new engine and compared to the baseline score. If the overall score improves, then the search
engine accuracy will be better when the engine is moved to production.
This process is also recommended for regression testing of all new search engines before they are
fielded to production to ensure that bugs introduced into the system do not adversely affect search
engine accuracy.
19
6 Additional Metrics
The engine score is the most complex and the most helpful score to compute to
determine engine accuracy. However, additional metrics provide useful insights
to help determine where work needs to be applied.
6.1 Query Metrics
Most frequent queries Gives an idea of hot topics in the community and whats most on
the mind of the community.
Most frequent typed queries Information needs that are not satisfied by the queries on the
auto-suggest menu
Most frequent query terms Identifies individual words which may be worth improving with
synonyms.
100 random queries, categorized When manually grouped into use cases, gives an excellent
idea of the scope of the overall user search experience.
Most frequent spelling corrections Verifying the spellchecker algorithm and identifying
possible conflicts
6.2 Relevancy and Clicks
Percent of queries which result in click on a search result Should increase as the search
engine is improved.
Histogram of clicks on search results, by position Shows how often users click on the first,
second, third document, etc. An abnormally large drop off between positions may indicate a
user interface problem.
Histogram of number of clicks per query Ideally, the search engine should return many
good results, so higher number of clicks per query is better.
Bounce Rate Rate at which users execute one query and then leave the search site. High
bounce rates indicate very unsatisfied users.
6.3 Result Metrics
Total number and percentage of queries with zero results Large numbers here generally
indicates that more attention should be spent on spell checking, synonyms, or phonetic
searching for people names.
Most frequent queries with zero results Identifies those words which may require spell
checking or synonyms
20
Histogram of results per query Gives an idea of how generic or specific user queries are. If
the median is large then providing alternative queries (i.e. query suggestions) may be
appropriate.
6.4 Document Metrics
Top 1,000 least successful documents Documents most frequently returned by the search
engine which are never clicked.
Top 100 hidden documents Documents never returned by search.
Should be further categorized by time (i.e. Top hidden documents introduced this
week, this month, this quarter, this half-year, this year, etc.)
Indicates documents which have language miss-match problems with the queries.
Consider adding additional search terms to these documents.
Documents with best and worst language coverage These are documents with the most
words found in queries.
o
This involves processing all documents against a dictionary made up of all query
words. The goal is to identify documents which do not connect with users because
there is little language in common.
21
7 Continuous Improvement
A key goal for accuracy measurements is to produce metrics which can participate in a continuous
improvement process.
7.1 The Continuous Improvement Cycle

The continuous improvement cycle is shown below:
The steps in this process are:

1.
Make changes to the search engine to improve accuracy.
2.
Gather search engine response for all unique queries specified in the log files (or log file
sample set)
3.
Compute user scores as described above in section 5.3.
4.
Produce analysis reports including all metrics described above in section 6.
5.
Evaluate results to determine if the search engine did get better.

If the search engine did not get better, then the most recent changes will be reverted.
Otherwise, a manual investigation of the analysis reports and individual use cases (perhaps
from manual scoring, see section 8 below) will determine what further changes are
required.
22
7.2 Continuous Improvement Requirements

Implementing a continuous improvement process for on-going accuracy improvements has the
following requirements:
A stable snapshot of document data and log files

o
A QA system with all data index with the search engine under test
o
See section 4 above.
The production system cannot typically be used because:
1) It does not represent a stable snapshot, and
2) Accuracy testing will require executing 100s of thousands of queries

which is a load that is typically too large for production systems.
Automated tools to produce engine scores and metrics as described above in 7.1.
Note that the scoring algorithms specified above in section 7.1 can be run off-line on the QA server.
This is a requirement for a continuous improvement cycle, since only off-line analysis will be
sufficiently agile to allow for running the dozens of iterative tests necessary to optimize relevancy.
7.3 Recording Performance from Run to Run

It is expected that a continuous improvement cycle will ultimately be implemented. This cycle should
be executed multiple times as the system is tuned up and improved. As this happens, it is important
to record the improvement from release to release.
The following is an example of such a recording from a prior customer engagement:
whole score
%q w match
%q w zero results
rev3
0.226567992
28.84%
33.62%
rev83
0.237346148
29.97%
31.71%
rev103
0.241202611
30.95%
28.01%
latest
0.251230958
32.14%
25.83%
As you can see, the whole score steadily improved while the percentage of queries with at least
one relevant document increased and the percentage of queries which returned zero results steadily
decreased.
23
It is this sort of analysis which engenders confidence that the process is working and is steadily
improving the accuracy of the search engine, step by step.
The customer for whom we ran this analysis next performed an A/B test on production with
live users to verify that the search engine performance would, in fact, lead to improved user
behavior.
This was an e-commerce site, and the results of the A/B test was a 3% improvement in
conversion rate, which equated to a (roughly) $4 million improvement in total sales that year.
24
8 Manual Relevancy Judgments

In addition to a fully automated statistical analysis, Search Technologies
recommends implementing a manual judgment process.
8.1 The Relevancy Judgment User Interface

To improve productivity of the judgers, Search Technologies recommends a user interface to
manually judge relevant documents for queries. This is a simple user interface backed by simple files
of relevancy judgments.
To prepare for the judgment process:
Selects 200 random queries from the cleansed query logs
Queries may need to be annotated to clarify the intent of the query for whoever is
performing the relevancy judgments.
Executes each query on the search engine and saves the results
The relevancy judgments user interface will do the following:
Shows the search results

o
Should use the same presentation as the production user interface
Provides buttons to judge relevant, non-relevant, relevant but old, and unknown for
every query
The same 200 queries are used from run to run, and so the database of what documents are relevant
for what queries can be maintained and grown over time.
To increase judger consistency, Search Technologies recommends creating a relevancy judgers
handbook a document which identifies how to determine if a document is relevant or not to a
particular query. The handbook will help judgers decide the relevancy between documents which are
principally about the subject, versus documents which contain only an aside-reference to the subject
and similar judgment decisions. Search Technologies has written such hand books in the past and can
help with writing the hand book appropriate to your data and user population.
8.2 Advantages
Manual relevancy judgments have a number of advantages over strictly log-based metrics:
It forces the QA team to analyze queries one-by-one

o
This helps the team recognize patterns across queries in terms of why good
documents are missed and bad documents are retrieved.
25
It provides a very clean view of relevancy

o
The manual judgments are uncluttered by external factors which affect log data
(machine crashes, network failure, received a phone call while searching, user
entered the wrong site, etc.)
Over time, it helps identify recall as well as precision
The last point is perhaps the most important. Log based relevancy can only identify as relevant
documents which are shown to users by the search engine. Naturally this is the most important
aspect of relevancy (Are the documents that I see relevant?).
But it does ignore the second aspect of relevancy, Did the search engine retrieve all possible relevant
documents for me? This second factor can only be approached with a relevancy database which is
expanded over time.
8.3 Statistics from Manual Judgments

The following statistics can be computed from manual relevancy judgments:
Percent relevant documents in the top 10 This is perhaps the most useful score, since it
provides an easy-to-understand number on what to expect in the search results.
Percent queries with at least one relevant document retrieved in the top 10
Percent total relevant retrieved in the top 100 This is recall at 100, which identifies how
well this configuration of the search can respond to deeper research requirements
Percent queries with no relevant documents retrieved
26
9 Using Logs to Analyze User Intent

A second use of search and click logs is to analyze user intent by looking at how
users interact with the internal search user interface.
9.1 Query Analysis

The first step for determining user intent is to look at searches executed by users.
9.1.1 Top Queries

A simple but useful analysis is to look at the top 200 most frequent queries executed by users.
The top 10-20% of queries, especially, are indicative of whats top of mind for the user community.
All queries should be categorized by intent by someone who is well versed in the activities of the
content domain.
Is the user looking for a particular document or web site which they already know exists?
Is the user looking to answer a question?
Is the user looking for a set of documents for research?
Is the user looking for a set of documents for self-education?
Naturally, it may be difficult to determine intent simply from the query. See section 9.1.4 how this
process can be refined and extended.
9.1.2 Randomly Selected Queries

The top most frequent queries do not, in fact, give a good idea of what the user population is
searching for. This is because the top queries are often the easiest, most obvious, and least variable
queries to enter. These are often most frequent simply because they are simple and consistently
entered.
In particular, personal names and document titles vary widely. Rarely will a single individual or
document title be in the top 200 most frequent queries, and yet these are often the most common
queries.
Therefore, a randomly selected set of 200 queries should be analyzed for intent and categorized.
9.1.3 Overall Internal Search Usage

A histogram of the number of queries executed by the same user should be performed. This includes:
Histogram (mean, median, min, max) of the number of queries executed in a session
27
Histogram (mean, median, min, max) of the number of queries executed over 3 months
Total queries executed per month, trending
Total unique users, per month, trending
This will provide a good look at how often the user population turns to search as part of their daily
work, and how often they find it to be a valuable tool, overall.
9.1.4 Randomly Selected Users

Finally, a set of 25-50 randomly selected users should be analyzed for use cases.
This should be done in three groups:
1.
Users who execute a single query in a session.

The goal here is to try and determine why these users only executed a single query. Did they
find what they were looking for? Were the results too poor to think about continuing?
2.
Users who execute three or more queries in a session.

These are, presumably, users with a more substantial need who are willing to keep searching
to answer their question.
How did these users reformulate or re-cast their query? Could these reformulations have been
performed automatically by the search engine? What were these users looking for?
3.
Users who execute ten or more queries over months period.

The goal here is to understand those power users who frequently return to internal search.
Do they execute the same query over and over? What are their primary use cases?
9.1.5 Help Desk Analysis

If your organization has a search help line, this is a valuable source of information for determining
user intent and usage models.
To the extent possible, search help desk activity should be captured and analyzed. This can include
chat logs, e-mails, voice transcriptions, etc. as available.
9.2 Tooling for Query Analysis

Query analysis, of necessity, requires 1) a keen understanding of the user and their goals and needs,
2) a deep understanding of the content and what it can provide to the user, and 3) how the users
intent is expressed through queries.
While this is primarily a manual process, additional tooling can help to categorize and cluster queries.
This includes:
28
Token statistics
o
Identifying the frequency of all query tokens
Across all queries
Across all documents
Dictionary lookups
o
Including in domain-specific dictionaries (ontologies, synonyms lists, etc.)
General, natural language dictionaries
Stop-word lists
Other, use-case specific dictionaries, such as dictionaries for:
First and last names from the census
Ticker symbols
Wikipedia entries (with types)
Regular expression matching
BNF pattern matching

o
Cross-query comparisons
o
Patterns of tokens by token category types
For example, identifying all word pairs which exist as single tokens in other queries
(and vice-versa)
User query-set analysis

o
Comparing queries within a single users activity set
This is used for analyzing query chains, to determine if one type of query often leads
to another type of query.
Statistical analysis
o
Extracting statistics from the queries is essential for using these statistics to
determine query types and the impact that can be achieved by working on a class of
query.
Generally, these sorts of statistics are computed using a series of ad-hoc tools and programs
including UNIX utilities and custom software programs. If the query database is very large, then Big
Data and/or Map Reduce algorithms may be required.
29
9.3 User Interface Click Analysis

Analyzing clicks within the user interface is another key component to understanding how employees
are interacting with search.
9.3.1 Feature Usage

The first and most important analysis is to determine what user interface features are being used and
how frequently. This includes:
Tabs
Facets
Side-bar results
Recommended results (e.g. best bets or elevated results)
Sorting
Paging
Advanced search
A thorough understanding of the usage of these features will help determine what is working and
what is not, and what should be kept and what should be abandoned.
9.3.2 Sequences
Second, click sequences can help determine if users are leveraging user interface features to their
best advantage.
The goal here is to determine if a user interface feature leads the user to information which helps
solve their problem.
How often does a facet click lead to a document click?
How often does a side-bar click lead to a new search?

o
And how often does this lead to a document click?
How often does a tab click lead to a document click?
How often does a sort click lead to a document click?
In this way, we can determine if user interface features are leading to actual improvements in the
end-users ability to find information.
30
9.4 Long Tail Analysis

A long tail analysis involves the manual analysis of a very large number of queries (5,000 to 10,000
queries). The goal of this analysis is categorize long-tail queries so they can be properly binned and
dealt with using a variety of techniques.
Long tail analysis shifts the process from analyzing top queries (of any sort), to a large-scale manual
analysis of a very large volume of queries.
9.4.1 Query Database

The essence of long-tail analysis is a large and evolving query database. This database includes:
Status flags (new, analyzed, unknown, deferred, tentatively analyzed, problem, solved, good
results, bad results, etc.)
The query the user entered
The date time the query was entered into the database
The assigned categories and sub-categories for the query
Other description text / explanation
User who analyzed the query
9.4.2 Supporting Databases

These are typically created on an as-needed basis, but may include:
Query logs indexed by query

o
So that sets of similar queries can be searched and grouped and processed together
For example, all queries that start with I need or account information, etc.
This also allows retrieval of other query metadata:

List of users who have executed the query
List of times (for trend lines) on when the query was executed
(if possible) geographic locations for the query
User information indexed by user

o
So that information on an individual user (i.e. user events) can be quickly brought up
and analyzed when analyzing use cases.
Click logs indexed by user and time

o
So that a users traffic can be analyzed to help determine intent
31
9.4.3 Tooling Support

To handle manual reviews of such a large set of databases, a search engine over the log data is
recommended. Further tooling will:
Link all queries to the query database

o
To determine the status and disposition of the query
Allow for import and export to the query database

o
For off-line analysis of sets of queries
9.4.4 Long Tail Analysis Outputs

The outputs of the long-tail analysis include:
Categories & Sub-Categories

o
These represent use cases and query patterns to be handled as a set
Can include categories / use cases for documents as well
General language understanding

o
Query patterns
Nouns / verbs
Sophistication of language / education level of user
Jargon / lay-language
Common question / answer patterns

o
For specific information or open-ended research
Synonyms
o
I.e. major document types and use cases
(antivirus security software, mic microphone)
Identify recommended re-direction patterns

o
Account information Account search / custom responses
Support & help queries Support search
Generic company information Website .com search
People queries People database
Identify best bet (recommended result) patterns
Identify user interface search presentation issues

o
Incorrect fields being displayed in results, hiding important information, etc.
Identify seasonal patterns (where appropriate)
32
And along with each use case, an idea of the scope of the use case and what sorts of fixes are
required to improve user satisfaction.
9.4.5 More Information

For high-value content sources, long-tail analysis will provide the most thorough and complete
picture of how users are using the system.
Search Technologies has implemented long-tail analysis at one high-value e-Commerce site, where
we have refined the process.
33
10 A/B Testing
Where possible, A/B testing is recommended to validate the engine score and other
improvements and to determine the exact relationship between engine score
improvements and other web site metrics (i.e. abandonment rates and conversion rates).
There can be no step by step process for doing A/B testing, since it will involve your production
system and production data. Therefore, it must be handled carefully and with production
considerations.
The following are some broad guidelines for A/B testing.
Both systems, A & B, need to be up simultaneously

o
The only way for A/B testing to be accurate is if incoming requests are randomly
assigned to A or B for a period of time.
This will ensure that the users for A or B are drawn from the exact same user
population.
A/B does not need to be 50/50

o
Typically, A/B testing is 95/5.
95% on the current production system
5% on the new (under test) system.
This way, if the B system is severely deficient, then only a small percentage of
production users will be affected.
Routing between A & B can occur at any of a variety of different levels:

o
User interface
Results mixing / query processing
Two user interfaces (A / B) to serve different user populations

Two different results mixing models or query processing models which query
over the same underlying indexes
Multiple indexes
Different indexes indexed in different ways
Ideally, of course, a system will be designed and implemented with A/B testing as a primary goal
from the very beginning. If it is, then turning on a B system for testing becomes a standard part of
system administration and testing procedures.
34
11 Conclusions
This paper is the result of some 22 years of search engine accuracy testing, evaluation, and practical
experience. The journey started in 1992 with the first TReC Conference, which several Search
Technologies employees attended. Many of the philosophies and strategies described in this paper
trace their roots back to TReC.
The journey continued throughout the 1990s, as we experimented with varying relevancy ranking
technologies, and were the first (to my knowledge) to use machine learning to optimize relevancy
ranking formulae for relevancy based on manual judgments.
As we enter the age of the Cloud and Big Data, we now have vastly more data available and the
machine resources required to process it. This has opened up new and exciting possibilities for
improving search not just for a select few but for everyone.
At Search Technologies we continue to work every day to make this vision, a vision of high-quality,
optimized, tuned, targeted, engaging, and powerful search a reality for everyone. This paper is just
another step in the journey to that ultimate goal.
For further information or an informal discussion CONTACT US.
35

Search Accuracy Analytics White Paper

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Search Accuracy Analytics White Paper

Caricato da

Copyright:

Formati disponibili

Fall