Sei sulla pagina 1di 9

The Power of Elasticsearch

What is Search?
A horizontallyscalable, distributed database built on Apaches Lucene that delivers a fullfeatured search experience across terabytes of data with a simple yet powerful API.
Search is a feature that can make or break an application or service. The ability to do full-text search across a body of text, to narrow search queries by values or ranges of specific fields, to use advanced features like faceted search, geo queries, or similarity searching can add a wealth of functionality and act as a differentiator for your product. When search fails, either because of inflexible query parameters, irrelevant results, or an inability to scale to meet demands of volume or usage, users notice immediately -- and they will be upset.

What is Elasticsearch?
Elasticsearch is a new database built to handle huge amounts of data volume with very high availability and to distribute itself across many machines to be fault-tolerant and scalable, all the while maintaining a simple but powerful API that allows applications from any language or framework access to the database.

2012 Infochimps, Inc. All rights reserved.

Built on Lucene Apaches Lucene is an open-source Java library for text search. The Lucene project has been growing for more than a decade and has now become the standard reference for how to build a powerful yet easy to integrate, open-source search library. Its feature set includes but is not limited to: Performance/Scalability Can index over 95 GB/hr/ node Low RAM overhead (1MB heap) Search Features Ranked/relevance searching with fine-grained control Allows querying on phrases, wildcards, geographical proximity, variable ranges, &c. Accessibility Completely open-source Implemented in Java so inherently cross-platform

Wrapping Lucene for Big Data Lucene, as a search library, must be wrapped with an interface to allow its features to be used by an application. Many such interfaces have been built for different platforms and use cases. One of the most popular is Apaches own SOLR project, which creates an interface around Lucene tailored for something like a traditional web application. An interface like SOLR, however, is designed for a world in which a single server can handle the full workload of indexing and querying the data. When the data volume begins to increase past this limit, SOLR (and similar interfaces to Lucene) become unwieldy to use: the same problems of sharding, replication, and query dispatching that occur in RDBMS systems begin occur again in this context. And just as various methods exist for dealing with these difficulties in the RDBMS world, various tools exist for shard creation and distribution around SOLR. But just as the right solution to big data databases means moving away from RDBMS into NoSQL technologies, the right solution to scaling Lucene is to move away from tools like SOLR and use a tool built from the ground-up to work with terabytes of data in a horizontally scalable, distributed, and faulttolerant way: Elasticsearch!

2012 Infochimps, Inc. All rights reserved.

Elastic Search Features


Elasticsearch is best thought of as an interface to Lucene designed for big data from the ground up. The complex feature set that Lucene provides for searching data is directly available through Elasticsearch, as Lucene is ultimately the library thats used for indexing and querying data. This also means that plugins that work with Lucene will work with Elasticsearch out of the box. The features that Elasticsearch itself provides around Lucene are designed to make it the perfect tool for full-text search on big data: Performance/Scalability An 8-node cluster can provide sub-200ms response latency when performing complex searches on 10B+ records! Add or subtract nodes on the fly to dynamically scale the cluster to the current load Ability to independently scale the indexing and querying performance of the cluster to deal with different sorts of use cases Robustness No single point of failure East of Use Simple, JSON-based REST API means any language can index or query records in an Elasticsearch cluster. Java and Thrift APIs exist for finer-grained or more performant access. Flexible schemas allow for complex treatments of types like dates without forcing all documents in a table to be identical. Multiple indices enable multi-tenancy out of the box.

Automatically backup all data in the cluster to local disk or permanent, remote storage (like AWS S3 service). Tune the replication factor of data on a per-index level

Data will automatically be migrated through the cluster if a node fails to maintain performance and replication factor.

2012 Infochimps, Inc. All rights reserved.

Example Use Cases


Theres a lot you can do with Elasticsearch besides just searching for phrases. The following examples act as a quick guide to just a few of the features Elasticsearch provides. Powerful Query Syntax The simplest way to interface with Elasticsearch is also one of the most powerful: the query string. Elasticsearch exposes the full Lucene query syntax through query strings that can be passed from a user in an application directly to the database to be evaluated. Feature Boolean logic Wildcards Query String (coke OR pepsi) AND health apple AND ip*d Wildcards can be applied for a single character (?) or for groups of characters (*). Can search on deeply nested fields like author.lastName as well. Notes

Specific search fields

coffee AND author:Smith

Search within a range Boost results in relevance

apple AND date:[20100101 TO 20100201] taxicab AND (New York^2 or San Francisco) Boosting can also be configured at index time.

2012 Infochimps, Inc. All rights reserved.

Records can be Complex Documents A record in Elasticsearch doesnt have to be flat like a record in a traditional RDBMS. Elasticsearch allows documents to be hierarchical, and for sub-fields within a document to themselves have hierarchical structure. This makes data modeling very flexible. An example of how one might store a blog post: {

id: 1001, author: { name: Alexander Hamilton, }, date: 1787-10-07 12:31:00 -0600 CST, title: The Federalist Papers, subtitle: Paper #1 text: AFTER an unequivocal experience of the inefficiency... similar_posts: [ 1002, 1003, 1005] comments: [ { author: John Adams, text: I must beg to differ... }, ] id: 3874

We could query these records using author.name or even comments.text, giving us a great deal of flexibilty in how we choose to denormalize and access the data in the database.

2012 Infochimps, Inc. All rights reserved.

Geo Queries Elasticsearch understands geography. Geolocations can be stored within records as (latitude, longitude) pairs or as geohashes. In either case, Elasticsearch provides the ability to query using a variety of geo-methods:

Geo queries defined with a bounding box

Geo queries defined by distance range from a given point

2012 Infochimps, Inc. All rights reserved.

Time Series Things change. Its important to see how. Elasticsearch understands dates and times and can return time series data which represent an aggregation of the search results binned by time interval.

Raw tweets stored in Elasticsearch can be binned into a time series on the fly at query time.

2012 Infochimps, Inc. All rights reserved.

Application Support Elasticsearch isnt just a search engine; its a full-fledged database, and you can build an entire frontend application on top of it. Elasticsearch supports multiple indices (databases) and multiple mappings (tables) per index. This feature, combined with the complex document structure Elasticsearch allows, lets you build the complex data models that support applications. And, in addition to being able to execute rich search queries across the data, Elasticsearch allows the more traditional operations that define an application database: listing records, creating records, updating records, and deleting records. These features give you what you need to build a traditional database-driven, read/write application on top of the same database that lets you do full-text search and complex queries, all with horizontal scalability built-in from the ground up. Administration & Monitoring Elasticsearch also exposes a complete administrative and monitoring interface over the same API that powers the indexing, retrieval, and search of data. Creating indices, updating their indexing or storage properties, defining rules for dealing with specific fields in specific mappings, &c. can all be accomplished via this same API. Getting detailed information about the clusters availability state, health, individual nodes memory footprint, &c. is also available through this API, making monitoring of Elasticsearch easy.

2012 Infochimps, Inc. All rights reserved.

About Infochimps
Our mission is to make the worlds data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data. Contact Us Infochimps, Inc. 1214 W 6th St. Suite 202 Austin, TX 78703 1-855-DATA-FUN (1-855-328-2386) www.infochimps.com info@infochimps.com Twitter: @infochimps

Get a free Big Data consultation


Lets talk Big Data in the enterprise!
Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other companies are solving similar problems. Learn best practices and get recommendations free.

2012 Infochimps, Inc. All rights reserved.

Potrebbero piacerti anche