Sei sulla pagina 1di 9

Why you can't just 'Google' for

Enterprise Knowledge

Title Why you can't just 'Google' for Enterprise Knowledge


Document # VI-401565-AN
Revision 1
Author Paul Walsh
Issue Date

Revision Author Date Details


A PW 26 June 2009 First draft
B PW 26 June 2009 Added content
C PW 30 June 2009 Incorporated review comments
D PW 01 July 2009 Added content
E/F PW 01 July 2009 Minor corrections
G PW 02 July 2009 Incorporated review comments
1 PW 07 July 2009 Minor additions and issued

© Cognidox Limited, 2009 Page 1 VI-401565-AN


Summary

This white paper reviews opinion on the key differences between Internet Search using
services such as Google or Bing and Enterprise Search behind the firewall. Some of the leading
commercial Enterprise Search solutions are introduced, before comparison with Open Source
alternatives (Swish-e, Lucene/Solr, Xapian/Flax and Sphinx). Our experience in developing a
plug-in search capability for CogniDox is described before some concluding observations on
the state of this industry sector.

Contents

Enterprise Search ................................................................................................................. 2


Enterprise Search Solutions .................................................................................................. 4
Proprietary Solutions.......................................................................................................... 4
Open Source Solutions ...................................................................................................... 5
Conclusions .......................................................................................................................... 7

Enterprise Search
Organizations tend to quickly create content. Most of it is unstructured, in the form of emails,
spreadsheets, documents, presentations and images, as opposed to structured database-type
content. Estimates vary, but there is said to be around 3 times as much unstructured as structured
content in the typical company (a 75% to 25% ratio).

Researchers at UC Berkeley are updating the “How Much Information?" report originally published
in 2003. According to the WSJ1, it will state that “information workers ... are each bombarded with
1.6 gigabytes of information on average every day through emails, reports, blogs, text messages,
calls and more”.

Companies are discovering that the cost of storage is miniscule compared to the cost of maintaining
their content – a recent estimate by AIIM is that 1 GB of content costs $0.20 to store, $3,500 to
manage.

Companies are squeezed on the one hand by a need to help employees find information (a 2008
AIIM survey found 69% of respondents felt that 50% or less of their organization's information can
be searched online) and on the other hand by their need to implement e-Discovery processes and
technology to meet the risk of litigation.

1
http://online.wsj.com/article/SB124252211780027326.html

© Cognidox Limited, 2009 Page 2 VI-401565-AN


Categorization and tagging helps to structure content, and make it easier to navigate and find
content, but it soon becomes necessary to search the content. This requires search engine
technology.

This is why Enterprise Search is an important topic for Document Management Systems (DMS) and
Enterprise Content Management (ECM). There is an oft-quoted IDC research statistic that knowledge
workers spend 25% of their average working week engaged in search-related activity, looking for the
information they need to do their job. A good content management application would seek to make
that time more productive.

Many people may think: “So what. Isn’t that just what people do 200 million times a day or more on
Google”? However, the fact is that Enterprise search requires a different solution than that provided
by the Internet search engines such as Google.

Why is Enterprise search different from Internet search?

The primary difference lies in what we as users are trying to achieve. When we type a search string
into Google we don’t especially mind that there are about 44.2 million results for “Enterprise
Search”. This is partly because the search took only 0.28 seconds, and partly because results 1-10
have a good chance of being relevant. This is because we use Google to find information, and we are
using it to search for an answer rather than the answer. It is what psychologists call “satisficing”
cognitive behaviour – we are not looking for the sharpest needle in the haystack, but rather one
sharp enough to sew with. We are quite tolerant of inefficiency in the Internet search experience. As
John Allen Paulos puts it: “The Internet is the world's largest library. It's just that all the books are on
the floor.”

The Google PageRank algorithms, which drive Internet search, are similar to the way that scientific
papers use a citation index. The more citations or links that each web page has; the more relevant or
important it is. So it is very useful for finding new information, because millions of other users have
trodden the same path.

But, if instead of finding new information you are looking to re-find information that you know
exists, then millions of results is suddenly an issue. You can’t even be happy with dozens of results.
You need to be able to trust that the content you find is exactly what you are looking for. This is the
same problem solved by the version control subsystem in a document management system – when I
search for my company’s “Product Pricebook” I really only want to see the latest and approved
version. It is irrelevant whether “Product Pricebook” has been looked at by hundreds of my co-
workers or only one. The number of links and previous hits are not useful to my search. In fact, it
may mislead me by directing me to stale link information, a common problem on company intranets
that don’t use versioning.

The converse of the difference with Internet search is that there are aspects of Enterprise search
that work in our favour. Unlike the Internet, we can use a more controlled vocabulary; search
structured as well as unstructured data; constrain the search parameters using metadata and/or
categories (taxonomies); and better trust the security and validity of the information.

For more on why Enterprise search differs from Intranet search, read these Blogs:
• http://www.ideaeng.com/tabId/98/itemId/154/20-Differences-Between-Internet-vs-
Enterprise-Se.aspx
• http://bexhuff.com/2009/02/why-google-will-never-be-good-at-enterprise-search
• http://thenoisychannel.com/2008/08/10/why-enterprise-search-will-never-be-google-y/

© Cognidox Limited, 2009 Page 3 VI-401565-AN


What does an Enterprise Search Engine do?

There are three main stages in Enterprise Search, and a key factor in search engine technology is
how much is supported, as well as the degree of customization, for each of these stages.

The first stage is Content Indexing - creating an index by crawling the content directories, databases
and other repositories using an automated process (a bot). The aim is to create an Index, which is a
searchable key to a collection. In Enterprise Search, the indexing mechanism should be able to
access company private data (with access privileges maintained), and you should have control over
the indexing schedule - being able to index rapidly changing content quickly, other content more
slowly. The bot may also be augmented with feed-in sources, which submit information to the index
rather than having the bot look for the data. Indexing may also support Metadata extraction and
Auto-summarization, which is to analyse the collection and group its content into categories or
clusters. These in turn become facets that can be used to tune the query to put emphasis on that
category.

The second stage is Querying - the engine processes the query, using the index, and finds any
content that matches the search subject. The results can be further processed by sorting them by
relevance or other clustering logic such as 'recommended' or 'best bets'.

When the data being searched has high quality metadata, it lends itself to search using a faceted
classification system. As Wikipedia puts it, this "allows the assignment of multiple classifications to
an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-
determined, taxonomic order". This may sound complex, but you have almost certainly used such a
system on an e-commerce site which separates products into different departments e.g. Electronics
> Computers > Laptops. Enterprises are now bringing faceted search to document control, where the
facets are metadata such as author, edition, keyword, access permissions, and so on.

The third stage is Formatting – the results are tabulated in some way for the end-user, usually
according to a template. But this could also be any other sort of data visualization, such as a tag
cloud. Again, this should take into account local user rights to the indexed data, and access control
must be applied to the search result content.

Enterprise Search Solutions


As with so many other areas of Enterprise Software, there are commercial /proprietary solutions and
open source solutions. Some of the open source solutions are completely “roll your own”, and some
are augmented by paid-for consultancy and support services.

Proprietary Solutions

Detailed analysis of commercial Enterprise search products and companies is available from analysts
such as Gartner and Overflight (http://arnoldit.com/overflight/). The “Big 3” proprietary software
solutions in this field according to Gartner are:
(1) Autonomy / Verity K2 (Verity was acquired by Autonomy in 2005 for $500M)
(2) Endeca IAP
(3) FAST ESP (Microsoft acquired FAST for $1.3B in 2008)
© Cognidox Limited, 2009 Page 4 VI-401565-AN
Some would extend this to include the Google Search Appliance (GSA), a hardware solution running
Linux and available in Mini and full-scale formats. The Mini caters for Enterprises with indexing
requirements up to 300K documents and costs in the region of $10,000, whereas a GB-5005 GSA
with capacity for up to 10 million documents would cost around $495,000. These appliances are
supported for 2 years – after that you are expected to replace them. The search algorithms used by
the Mini and the GSA are not public, but ranking uses all the Google algorithms, including PageRank.
There is more scope to tune the relevance ranking on the GSA, and it has other features not present
on the Mini such as direct database indexing, incremental indexing for new content, and query
stemming.

At least one commercial DMS system offers an Enterprise Search capability that is a bundle of their
software with a Google GSA. This is not as advanced a feature as it might seem, and in fact it may be
just an expensive way to acquire a server on which to run the DMS.

As an example of a commercial software solution, Autonomy’s IDOL technology is used by larger


Enterprises, Government and Intelligence Agencies. It uses “conceptual modelling to assess the
similarity between pieces of content within its index”. It recently acquired a content management
software company – Interwoven – for $775M. A Forrester report published in 2006 said their
average deal size was $360,000 with an $80,000 minimum license cost, so this is clearly not a cheap,
entry-level option. Autonomy’s revenue was $500M in FY2008.

As these proprietary solutions become challenged by open source solutions, it raises the question of
whether the latter are “industrial strength” enough to scale to really large document database sizes.
The following diagram (heavily adapted from the original2) shows how these solutions compare to
each other and to one open source solution (Lucene) that we will discuss in the next section.

For the purpose of this analysis, mid-level is 5,000 to 100,000 documents, and high-end is anything
from that to 100 million documents.

Open Source Solutions

As with so many other areas now in Enterprise software, there are Open Source alternatives that
attempt to provide equivalent functionality at a more favourable cost-of-entry.
2
http://www.ideaeng.com/tabId/98/itemId/85/Enterprise-Search-Matrix.aspx

© Cognidox Limited, 2009 Page 5 VI-401565-AN


These include:

Swish-e
Swish-e (http://swish-e.org/) is the elder of the set. A major update in 2000 saw the release of
Swish-e version 2 (where the -e stands for “enhanced”). It provides a way to index collections of
HTML, plain text, XML, PDF, MS Word, Excel and other tags. V2.4.7 is the latest stable release, as of
April 2009, but there is also a parallel development called Swish3 which will be a complete overhaul
of the code. One significant change is the adoption of the Xapian backend for indexing and search.

Apache Lucene / Solr / Nutch / Tika


Lucene Java (http://lucene.apache.org/) is an information retrieval API written in Java and ported to
many other languages. It is the base indexing and query library and is intended to be embedded in
other applications. Lucene is used as the search engine for the Wikipedia website. It is one of the
‘top 5’ Apache projects in terms of user download.

Apache Solr (http://lucene.apache.org/solr/) was released in 2007 and adds search server features
(such as faceted search, queries, de-duplication, and a web administration interface) on top of
Lucene. You also need to consider Apache Nutch (http://lucene.apache.org/nutch/) for web
crawling. Another tool worth a mention is Apache Tika (http://lucene.apache.org/tika/) which
automatically extracts metadata and structured text content from various document formats,
including MS-Office and PDF. All are available under the Apache Software license v2.0.

Comcast, the US cable service, high-speed Internet, and telephone service provider, recently
reported their experience in a trial evaluation of Lucene/Solr versus an unnamed commercial
product. Using a test-bench of up to 4 million documents, they found Solr outperformed the
commercial product in response rates and failure-handling.

Xapian / Flax
Xapian (http://www.xapian.org/) is a probabilistic information retrieval library written in C++ with
bindings to most common languages. Xapian is used by German newspaper Die Zeit, Debian and
Mydeco.

Just as Lucene is augmented by Solr, the Flax project (http://www.flax.co.uk/) provides an enterprise
search engine application on top of the Xapian search engine library. It provides features such as file
system indexers, file format translators, web spiders, sentiment analysis and automatic categorisers.
Currently in alpha, the Flax Search Service (FSS) provides a web services interface that supports
indexing and basic search functionality. Their roadmap is to add support for advanced features of
the Xapian engine such as facets and tagging, geolocation and image search. The Flax team was
chosen to be part of a DTI-funded collaborative project undertaken by e-Therapeutics PLC. As part of
this project a searchable index of over 100 million web pages was created, which demonstrates the
scalability of the Flax platform.

Sphinx Search
Sphinx (http://sphinxsearch.com/) is a full-text search daemon that has native support for indexing
SQL databases such as MySQL and PostgreSQL. It's used on sites such as Craigslist and DailyMotion.

It's not my intention to systematically compare these different Open Source projects here. Each has
a support base; users tend to have good reasons for using their preferred choice; and in any event,
there are currently no reliable benchmark tests.

© Cognidox Limited, 2009 Page 6 VI-401565-AN


However, a quick trawl around various blog sites does throw up some observations:

• The development community for Lucene/Solr is both large and active; for Xapian it is mid-
size and active; and Sphinx it is smaller but active.
• Swish-e is more in a state of transition than the others, and will need to be re-evaluated
when Swish3 is widely deployed
• Compared to both Xapian and Sphinx, Lucene/Solr is considered difficult to install and
configure correctly unless you are familiar with Java deployment technologies. However, this
could be seen as a reflection of the power and flexibility of the configuration options
• Xapian is slower than Lucene/Solr at indexing, but is faster and possibly more accurate at
querying
• Xapian is slower than Sphinx at indexing, but faster at querying
• Lucene/Solr is more scalable than Xapian with large datasets, and Xapian is more scalable
than Sphinx
• Lucene/Solr is claimed to be more suited than the others for distributed or 'federated
search' across a cluster of servers
• Sphinx may be more suited for sites with a large amount of unstructured text (e.g. a
newspaper website) but it will require a deal of SQL optimisation

Our experience with CogniDox is as follows: we were early adopters of Swish-e (around 2001) and it
has served us reasonably well. We have been getting feedback that the relevancy of search results
was an issue, and users were surprised to find that some known content was not showing up in
search results. There was also dissatisfaction with the search syntax and use of operators – anything
that isn’t Google-like can be a negative learning transfer problem because users use Google so much
in their use of the Internet. We also were unhappy that the stable Swish-e version didn’t support
incremental indexing – we had to re-index once a day and content was unsearchable until we did.

We could have waited for Swish3, but decided to first try Lucene/Solr/Tika as a replacement. We are
impressed with the results it produces and the range of search engine features. It was reasonably
straightforward to set up (and we’ll be hiding much of that complexity in our application wrapper in
any event). The effort spent is really in the tailoring for the data source.

We also have experimented with Xapian/FSS. Again, we were equally impressed with the results,
but this time we found it even easier to set up. FSS is still in alpha release, but it will be monitored
carefully. If it continues to advance at the same rate, this is likely to be our chosen default.

Both the Lucene and the Xapian-based solutions look as though they are very scalable, and will
support the type of synchronized multi-server clusters we need for document management.
Scalability is not as big an issue as perhaps in other sectors, but we have worked with Electronics
Design companies who have generated document stores in excess of 100,000 files.

Conclusions
We think:

• The argument that Enterprise Search requires different solutions than Internet Search is
compelling (even Google says so to some extent) but the commercial software and/or
appliance solutions are too expensive for smaller and medium sized Enterprises

© Cognidox Limited, 2009 Page 7 VI-401565-AN


• Buying a document management company (as Autonomy did with Interwoven) to bundle
into your search technology offering may be a good idea, but is swimming against the tide
when open source technology is becoming as good, without the high price tag

• Integrating your document management software with a Google Search Appliance may not
be really adding significant Enterprise search value – it depends to a large extent whether
the Google algorithms are appropriate, and also whether you use a Mini or a GSA

• Open source text search engines such as Lucene/Solr and Xapian are getting very close to
matching e.g. Autonomy IDOL, but they are still a library for programmers and need
integration into Enterprise applications to exploit their full potential

• There is critical mass behind Lucene/Solr as the leading open source enterprise search
technology, but it is still early enough in the cycle to keep alternatives open in Xapian/Flax
and Sphinx

• We especially intend to track the Flax project because it seems to have the power and
scalability of Solr yet is easier to set up and configure, at least for us

© Cognidox Limited, 2009 Page 8 VI-401565-AN


Company Information
Registered Office : Cognidox Limited
St John’s Innovation Centre
Cowley Road
Cambridge CB4 0WS
UK

Registered in England and Wales No. 06506232

Email salesinfo@cognidox.com

Telephone +44 (0) 1223 911986

Smart Document Management


CogniDox helps teams in Engineering, Marketing, Sales, Operations and other
departments to capture, share and publish product and design documentation.

This easy-to-use tool helps break down the barriers to find information, share
solutions and enjoy a faster, more productive development workflow inside your
company. In addition, CogniDox will help you manage and publish documents and
other content to licensed customers. It reduces technical support load and
accelerates your customers' time to market.

www.cognidox.com

© Cognidox Limited, 2009 Page 9 VI-401565-AN

Potrebbero piacerti anche