Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
White Paper
Version 2.6.3
November 2014
2014 Search Technologies
14
Table of Contents
1
Summary................................................................................................................................ 5
1.1
1.2
1.3
1.4
1.5
3.2
3.3
3.4
3.5
5.2
5.2.1
5.2.2
5.2.3
5.3
Score Computation............................................................................................................ 17
5.3.1
5.4
5.5
5.6
Query Metrics.................................................................................................................... 20
6.2
6.3
Result Metrics.................................................................................................................... 20
6.4
7.2
7.3
8.2
Advantages ........................................................................................................................ 25
8.3
9.1.1
9.1.2
9.1.3
9.1.4
9.1.5
9.2
9.3
9.3.1
9.3.2
Sequences ................................................................................................................. 30
9.4
9.4.1
9.4.2
Supporting Databases............................................................................................... 31
9.4.3
9.4.4
9.4.5
1 Summary
The number one complaint about search engines is that they are not accurate.
Customers complain to us that their engine brings back irrelevant, old, or even
bizarre off-the-wall documents in response to their queries.
This problem is often compounded by the secretive nature of search engines and search engine
companies. Relevancy ranking algorithms are often veiled in secrecy (described variously as
intellectual property or more quaintly as the secret sauce). And even when algorithms are open to
the public (for Open Source search engines, for example), the algorithms are often so complex and
convoluted that they defy simple understanding or analysis.
For Corporate Wide Search Wasted employee time. Missed opportunities for new business.
Re-work, mistakes, and re-inventing the wheel due to lack of knowledge sharing. Wasted
investment in corporate search when minimum user requirements for accuracy are not met.
For e-commerce Search Lower conversion rates. Higher abandonment rates. Missed
opportunities. Lower sales revenue. Loss of mobile revenue.
For Publishing Unsatisfied customers. Less value to the customer. Lower renewal rates.
Fewer subscriptions. Loss of subscription revenue.
For Government Unmet mission objectives. Less public engagement. Lower search and
download activity. More difficult to justify ones mission. Incomplete intelligence analysis.
Missed threats from foreign agents. Missed opportunities for mission advancement.
For Recruiting Lower fill rate. Lower margins or spread. Unhappy candidates assigned to
inappropriate jobs. Loss of candidate and hiring manager goodwill. Loss of revenue.
The difficult truth is that all search engines require tuning, and all content requires processing. The
more tuning and the more processing you do, the better your search results will be.
Search engines are designed and implemented by software programmers to handle standard use
cases such as news stories and web pages. They are not delivered out-of-the-box to handle the wide
variety and complexity of content and use case that are found around the world. They need to be
tuned, and part of that tuning is to process the content so that it is easily interpretable.
And so there is no easy fix, no silver bullet and no substitute to a little bit of elbow grease (aka
hard work) when it comes to creating a satisfying search result.
2.
Create a snapshot of the log files and search engine index for testing.
3.
4.
5.
6.
7.
8.
Perform A/B testing to validate improvements and calculate Return on Investment (ROI).
These steps have been successfully implemented by Search Technologies and are known to provide
reliably, methodical, measurable accuracy improvements.
(Visual perception) Is it easy to see that the results contain good documents?
With many competing goals for search, it is easy to get lost when trying to figure out what is
important and what should be fixed (and why).
Most search accuracy metrics are from a query perspective. They ask questions like: What queries
worked? What are the most frequently executed queries? What queries returned zero results? How
do I improve my queries? etc.
In contrast, this paper presents a user focused approach to search accuracy:
1.
2.
We are interested, in this paper, as to the central question of is the user satisfied? We attempt to
answer this question by analyzing user activity to see if the search engine is providing results which
the user has found worthy of further activity.
The user-centered approach is a powerful approach with some subtle consequences. For example, if
a query is executed by 10 different users, then that query will be analyzed 10 times once from each
users point of view (analyzing the activity stream for each user individually).
Further this approach is much more accurate by providing scores which are automatically normalized
to the user population. A very small number of highly interactive users will not adversely skew the
score, unlike with query-based approaches.
And finally, user-based scoring provides more detailed information for use in system analysis. It
scores every user and every users query (as well as the system as a whole). It identifies the least
effective queries and the most effective queries, and traces every query back to the user session so
the query can be viewed in context.
The user-based approach can answer all of these questions. The user-based approach can aggregate
all customer activity and leverage that user activity to determine the success (or failure) of every
query for every user, from each customers unique point of view.
Search Technologies has shown that this approach can dramatically improve conversion rates for ecommerce sites. In one implementation, we produced a 7.5% increase in conversion rate for
products purchased based on search results for one site, 3% increase overall validated by A/B
testing.
Continuous improvement
Manual scoring
A/B testing
Note that not all techniques will be appropriate for all situations. Some techniques are appropriate
for production systems with sufficient log information to use for analysis while others are better for
evaluating brand-new systems. Some techniques are for on-line analysis and others are for off-line
analysis.
Search Technologies recommends using this white paper as a reference guide. We have architects
and data scientists who can help you determine which methods and processes are best suited to
your situation. Once you have decided on a plan, we have lead engineers, senior developers, and
experienced project managers who can ensure that the plan is delivered efficiently, on-time, and
with the best possible outcome.
2 Problem Description
What often happens with fuzzy algorithms like search engine relevancy is that some
improvements make relevancy better (more accurate) for some queries and worse
(less accurate) for other queries.
Only by evaluating a statistically valid sample set can one know if the algorithm is better in the
aggregate.
For each new release of the search engine, the accuracy of the overall system must be
measured to see if it has improved (or worsened).
o
Simple bugs can easily cause dramatic degradation in the overall quality of the
system.
Therefore, it is much too easy for a new bug to slip into production unnoticed.
The relative benefit of each parameter of the search relevancy algorithm must be measured
individually. For example:
o
How much will other query adjustments (weighting by document type, exact phrase
weighting, etc.) help the relevancy?
A data-directed method for determining what types of queries are performing poorly, and
what fixes are most likely to improve accuracy needs to be implemented.
o
Instead, this information on what queries to fix comes from anecdotal end-user
requests and evaluations.
The date and time (down to the second) when the search was executed.
2.
This is optional, but highly desired to connect searches with click log events.
b.
If the user executes the same search twice in a row, each search should have a unique ID.
3.
4.
5.
6.
7.
8.
The number of documents found by the search engine in response to the search.
9.
A number > 0 indicates the user was asking for page 2-n of the results.
10. The number of rows requested (e.g. the page size of the results).
11. The URL from where the search was executed (like the Web log referer field).
12. A code for the type of search.
a.
Should indicate if the user clicked on a facet value, an advanced search, or a simple
search.
13. A list of filters (advanced search or facets) turned on for the search.
14. The time it took for the search to execute (milliseconds).
Notes:
Some of these parameters may be parsed from the URL used to submit the search.
Test searches should be removed (such as searches on the word test or john smith).
Searches from test accounts should be removed (see auditing logs, section 3.5 below).
10
Searches executed automatically behind the scenes when the user clicks on a link
should be removed.
Other sorts of administrative searches (for example, to download a list of all documents for
index auditing) should be removed.
The time stamp (date and time to the second) when the search result was clicked.
2.
The document ID (from the index) or URL of the search result selected.
3.
The unique search ID (see previous section, 3.1) of the search which provided the search
result.
4.
5.
The position within the search results of the document, where 0 = the first document returned
by the search engine.
In order to do this, search results must be wrapped (they cannot be bare URLs), so that when clicked,
the event is captured in the search engine logs.
11
Further, e-commerce and informational sites can identify users, especially when they have logged in,
their population groups and interaction history, using cookies or back-end server data.
Gathering additional information about users is optional, and can be deferred to a future
implementation when the basics of search engine accuracy have been improved.
Log file entries which are not queries or user clicks at all (HEAD requests, status requests,
etc.)
In other words, simply clicking on page 2 for a query should not increase the total
count for that query or the words it contains.
Canned queries automatically executed (typically through the APIs) to provide summary
information for special use cases
o
Where a click on the [SEARCH] button actually fires off multiple queries behind the
scenes
In other words, queries where the query expression is fixed and executed behind the
scenes
On public sites, there may be spam such as random queries, fake comment text containing
URLs, attempts to influence the search results or suggestions, or even targeted probes.
12
Log cleanup is a substantial task by itself and requires some investigative work on how the internal
search system works. It must be treated carefully.
13
The snapshot must include a complete data set and log database.
Any time-based query parameters will need to be fixed to the time of the snapshot.
o
Naturally, the snapshot will need to be periodically refreshed with the latest production data.
However, when this occurs:
o
The same engine and accuracy metrics must be computed on both the old and new
snapshots.
This way, score changes based merely on differences in the document database and
recent user activity can be determined and factored into future analysis.
14
Similarly, when queries are scored, these scores use data derived from the user who executed the
query. If a query is executed by multiple users, then that query is judged multiple times from each
users individual perspective.
This provides useful, user-based information on queries:
What are the queries which satisfy the highest percentage of users?
What are the queries which satisfy the smallest percentage users?
What are the most and least satisfying queries to the users who execute them?
15
Did the user hover-over to view details of the product or document preview?
All of these signals indicate that the user found the document (or product) sufficiently worthy of
some additional investigative action on their part. It is these documents that we consider to be
relevant to the user.
16
It is the assumption of this model that users will execute multiple queries to find the document or
product which they ultimately want. They may start with shirt, then move to long sleeved shirt,
then to long sleeved dress shirt and so on. Or their first query may be misspelled. Or it may be
ambiguous. Another way of saying this is that a relevant document found by query #3 is also relevant
if it was returned by query #1.
In all of these examples, it is desirable for the search engine to short circuit the process and bring
back relevant documents earlier in the query sequence. If this can be achieved, then the user gets to
their document or product much faster and is therefore more satisfied.
1.
a.
2.
Send every query to the search engine and gather the results RESULTS_SET[q]
b.
c.
3.
Compute the total engine score = average of all userScore[u] values for all users
Notes:
A random sampling of users can be used to generate the results, if computing the score
across all users requires too much time or resource.
All search team members should be removed from engine score computations, to ensure an
unbiased score.
17
'K = 0.5
'K = 0.667
'K = 0.75
'K = 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Score = 1.0
o
Depending on the data, scores can be as low as 0.05 (this typically means that there are other
external factors affecting the score such as filters not being applied, etc.). A score of 0.25 is generally
thought to be very good.
Note that the score does have an upper limit, which can be computed based on the factor:
18
Top 100 lowest scoring (i.e. least satisfied) users with more than 5 queries.
2.
Top 100 highest scoring (i.e. most satisfied) users with more than 5 queries.
3.
4.
5.
6.
Engine score per unit, for all users from each unit.
7.
Engine score per location, for all user from each location.
8.
Engine score per job title, for all users with the same job title.
This information is critical for determining what is working, what needs work and how the search
engine is serving various sub-sets of the user community when evaluating search engine accuracy.
19
6 Additional Metrics
The engine score is the most complex and the most helpful score to compute to
determine engine accuracy. However, additional metrics provide useful insights
to help determine where work needs to be applied.
Most frequent queries Gives an idea of hot topics in the community and whats most on
the mind of the community.
Most frequent typed queries Information needs that are not satisfied by the queries on the
auto-suggest menu
Most frequent query terms Identifies individual words which may be worth improving with
synonyms.
100 random queries, categorized When manually grouped into use cases, gives an excellent
idea of the scope of the overall user search experience.
Most frequent spelling corrections Verifying the spellchecker algorithm and identifying
possible conflicts
Percent of queries which result in click on a search result Should increase as the search
engine is improved.
Histogram of clicks on search results, by position Shows how often users click on the first,
second, third document, etc. An abnormally large drop off between positions may indicate a
user interface problem.
Histogram of number of clicks per query Ideally, the search engine should return many
good results, so higher number of clicks per query is better.
Bounce Rate Rate at which users execute one query and then leave the search site. High
bounce rates indicate very unsatisfied users.
Total number and percentage of queries with zero results Large numbers here generally
indicates that more attention should be spent on spell checking, synonyms, or phonetic
searching for people names.
Most frequent queries with zero results Identifies those words which may require spell
checking or synonyms
20
Histogram of results per query Gives an idea of how generic or specific user queries are. If
the median is large then providing alternative queries (i.e. query suggestions) may be
appropriate.
Top 1,000 least successful documents Documents most frequently returned by the search
engine which are never clicked.
Should be further categorized by time (i.e. Top hidden documents introduced this
week, this month, this quarter, this half-year, this year, etc.)
Indicates documents which have language miss-match problems with the queries.
Consider adding additional search terms to these documents.
Documents with best and worst language coverage These are documents with the most
words found in queries.
o
This involves processing all documents against a dictionary made up of all query
words. The goal is to identify documents which do not connect with users because
there is little language in common.
21
7 Continuous Improvement
A key goal for accuracy measurements is to produce metrics which can participate in a continuous
improvement process.
2.
Gather search engine response for all unique queries specified in the log files (or log file
sample set)
3.
4.
5.
22
A QA system with all data index with the search engine under test
o
Automated tools to produce engine scores and metrics as described above in 7.1.
Note that the scoring algorithms specified above in section 7.1 can be run off-line on the QA server.
This is a requirement for a continuous improvement cycle, since only off-line analysis will be
sufficiently agile to allow for running the dozens of iterative tests necessary to optimize relevancy.
%q w match
%q w zero results
rev3
0.226567992
28.84%
33.62%
rev83
0.237346148
29.97%
31.71%
rev103
0.241202611
30.95%
28.01%
latest
0.251230958
32.14%
25.83%
As you can see, the whole score steadily improved while the percentage of queries with at least
one relevant document increased and the percentage of queries which returned zero results steadily
decreased.
23
It is this sort of analysis which engenders confidence that the process is working and is steadily
improving the accuracy of the search engine, step by step.
The customer for whom we ran this analysis next performed an A/B test on production with
live users to verify that the search engine performance would, in fact, lead to improved user
behavior.
This was an e-commerce site, and the results of the A/B test was a 3% improvement in
conversion rate, which equated to a (roughly) $4 million improvement in total sales that year.
24
Queries may need to be annotated to clarify the intent of the query for whoever is
performing the relevancy judgments.
Executes each query on the search engine and saves the results
Provides buttons to judge relevant, non-relevant, relevant but old, and unknown for
every query
The same 200 queries are used from run to run, and so the database of what documents are relevant
for what queries can be maintained and grown over time.
To increase judger consistency, Search Technologies recommends creating a relevancy judgers
handbook a document which identifies how to determine if a document is relevant or not to a
particular query. The handbook will help judgers decide the relevancy between documents which are
principally about the subject, versus documents which contain only an aside-reference to the subject
and similar judgment decisions. Search Technologies has written such hand books in the past and can
help with writing the hand book appropriate to your data and user population.
8.2 Advantages
Manual relevancy judgments have a number of advantages over strictly log-based metrics:
This helps the team recognize patterns across queries in terms of why good
documents are missed and bad documents are retrieved.
25
The manual judgments are uncluttered by external factors which affect log data
(machine crashes, network failure, received a phone call while searching, user
entered the wrong site, etc.)
The last point is perhaps the most important. Log based relevancy can only identify as relevant
documents which are shown to users by the search engine. Naturally this is the most important
aspect of relevancy (Are the documents that I see relevant?).
But it does ignore the second aspect of relevancy, Did the search engine retrieve all possible relevant
documents for me? This second factor can only be approached with a relevancy database which is
expanded over time.
Percent relevant documents in the top 10 This is perhaps the most useful score, since it
provides an easy-to-understand number on what to expect in the search results.
Percent queries with at least one relevant document retrieved in the top 10
Percent total relevant retrieved in the top 100 This is recall at 100, which identifies how
well this configuration of the search can respond to deeper research requirements
26
Is the user looking for a particular document or web site which they already know exists?
Naturally, it may be difficult to determine intent simply from the query. See section 9.1.4 how this
process can be refined and extended.
Histogram (mean, median, min, max) of the number of queries executed in a session
27
Histogram (mean, median, min, max) of the number of queries executed over 3 months
This will provide a good look at how often the user population turns to search as part of their daily
work, and how often they find it to be a valuable tool, overall.
2.
3.
28
Token statistics
o
Dictionary lookups
o
Stop-word lists
Ticker symbols
Cross-query comparisons
o
For example, identifying all word pairs which exist as single tokens in other queries
(and vice-versa)
This is used for analyzing query chains, to determine if one type of query often leads
to another type of query.
Statistical analysis
o
Extracting statistics from the queries is essential for using these statistics to
determine query types and the impact that can be achieved by working on a class of
query.
Generally, these sorts of statistics are computed using a series of ad-hoc tools and programs
including UNIX utilities and custom software programs. If the query database is very large, then Big
Data and/or Map Reduce algorithms may be required.
29
Tabs
Facets
Side-bar results
Sorting
Paging
Advanced search
A thorough understanding of the usage of these features will help determine what is working and
what is not, and what should be kept and what should be abandoned.
9.3.2 Sequences
Second, click sequences can help determine if users are leveraging user interface features to their
best advantage.
The goal here is to determine if a user interface feature leads the user to information which helps
solve their problem.
In this way, we can determine if user interface features are leading to actual improvements in the
end-users ability to find information.
30
Status flags (new, analyzed, unknown, deferred, tentatively analyzed, problem, solved, good
results, bad results, etc.)
The date time the query was entered into the database
So that sets of similar queries can be searched and grouped and processed together
For example, all queries that start with I need or account information, etc.
List of times (for trend lines) on when the query was executed
So that information on an individual user (i.e. user events) can be quickly brought up
and analyzed when analyzing use cases.
31
Query patterns
Nouns / verbs
Jargon / lay-language
Synonyms
o
32
And along with each use case, an idea of the scope of the use case and what sorts of fixes are
required to improve user satisfaction.
33
10 A/B Testing
Where possible, A/B testing is recommended to validate the engine score and other
improvements and to determine the exact relationship between engine score
improvements and other web site metrics (i.e. abandonment rates and conversion rates).
There can be no step by step process for doing A/B testing, since it will involve your production
system and production data. Therefore, it must be handled carefully and with production
considerations.
The following are some broad guidelines for A/B testing.
The only way for A/B testing to be accurate is if incoming requests are randomly
assigned to A or B for a period of time.
This will ensure that the users for A or B are drawn from the exact same user
population.
This way, if the B system is severely deficient, then only a small percentage of
production users will be affected.
User interface
Multiple indexes
Ideally, of course, a system will be designed and implemented with A/B testing as a primary goal
from the very beginning. If it is, then turning on a B system for testing becomes a standard part of
system administration and testing procedures.
34
11 Conclusions
This paper is the result of some 22 years of search engine accuracy testing, evaluation, and practical
experience. The journey started in 1992 with the first TReC Conference, which several Search
Technologies employees attended. Many of the philosophies and strategies described in this paper
trace their roots back to TReC.
The journey continued throughout the 1990s, as we experimented with varying relevancy ranking
technologies, and were the first (to my knowledge) to use machine learning to optimize relevancy
ranking formulae for relevancy based on manual judgments.
As we enter the age of the Cloud and Big Data, we now have vastly more data available and the
machine resources required to process it. This has opened up new and exciting possibilities for
improving search not just for a select few but for everyone.
At Search Technologies we continue to work every day to make this vision, a vision of high-quality,
optimized, tuned, targeted, engaging, and powerful search a reality for everyone. This paper is just
another step in the journey to that ultimate goal.
35