Sei sulla pagina 1di 8

ElasticSearch for indexing documents

Install elastic search:


- As of Nov 2016, latest version of ES is 5.0. Required JDK is 1.8.
- Download the tar or zip file and extract into a working folder.
- Set path to bin directory of the ElasticSearch.
- Run elasticsearch from command line.
The server will be started and listening on port 9200.
This is single node.
Terminology:
- Node: single instance of elasticsearch
- Cluster: Collection of single or more nodes running and sharing indices.
- Document: this the basic unit in elasticsearch equivalent to a row of a database.
This document is different from the actual document, i.e, MS-Word, PDF, XLS etc.
The document is represented in JSON format.
- Type: the Above said document can be defined to a type. It is equivalent to a Table in
RDBMS, Class in OOPs.
- Mapping: The definition of the type is done with mapping. It is the specification of the
attributes in the JSON file, equivalent to the schema definition of RDBMS.
- Index: it is the overall collection of the Types(=Tables). As a whole it can be defined as
the Database in RDBNS
Rest APIs for creating and accessing index:
- Elasticsearch provides REST api structure for indexing and accessing with JSON data.
Refer to the appendix for most of the REST API command.
There are tools available for manually creating and manipulating index: kibana, Sense
Even CURL can be used in CLI.
- Programmatically, we need to write client implementation with API provided by ES:
https://www.elastic.co/guide/en/elasticsearch/client/index.html

Client implementation in Java:


- In java, two apis are provided, Java API and JAVA REST API.
- Java REST API is used in this prototype
- The package org.elasticsearch.client need to be downloaded from Maven:
https://www.elastic.co/guide/en/elasticsearch/client/java-
rest/current/_maven_repository.html
- Dependent jars:
org.apache.httpcomponents:httpasyncclient
org.apache.httpcomponents:httpcore-nio
org.apache.httpcomponents:httpclient
org.apache.httpcomponents:httpcore
commons-codec:commons-codec
commons-logging:commons-logging
For indexing documents(word, pdf, etc):
- As per the latest version we need to install Ingest attachment processor plugin.
Follow
instructions:https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-
attachment.html
- The processor is used in a Pipeline:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/using-ingest-
attachment.html

Example:
http://localhost:9200/_ingest/pipeline/<pipelinename>
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties" : [
"title",
"author",
"content",
"date" ,
"keywords" ,
"content_type",
"content_length",
"language"
]
}
},
{ "remove" : { "field": "data" } }
]
}

- After creating the pipeline use it for indexing:


PUT my_index/my_type/my_id?pipeline=<pipelinename>
{
"data": "base64content",
filename: <filename>,
location:<file uri>
}

- The above index includes filename as an mapping to match the physical file. Additional
location uri can also be included. These fields are not part of ES API, user defined.
Searching index:
- Index is again searched via ES REST APIs. The query can be sent with query string(q
parameter or JSON obeject). ES defines separate language for it.

ES REST API (ver <5.0)


AliasesExist HEAD /{index}/_alias/{name}
Analyze GET /_analyze
Analyze GET /{index}/_analyze
Analyze POST /_analyze
Analyze POST /{index}/_analyze
Bulk POST /_bulk
Bulk POST /{index}/_bulk
Bulk POST /{index}/{type}/_bulk
Bulk PUT /_bulk
Bulk PUT /{index}/_bulk
Bulk PUT /{index}/{type}/_bulk
ClearIndicesCache GET /_cache/clear
ClearIndicesCache GET /{index}/_cache/clear
ClearIndicesCache POST /_cache/clear
ClearIndicesCache POST /{index}/_cache/clear
ClearScroll DELETE /_search/scroll
ClearScroll DELETE /_search/scroll/{scroll_id}
CloseIndex POST /_close
CloseIndex POST /{index}/_close
ClusterGetSettings GET /_cluster/settings
ClusterHealth GET /_cluster/health
ClusterHealth GET /_cluster/health/{index}
ClusterReroute POST /_cluster/reroute
ClusterSearchShards GET /_search_shards
ClusterSearchShards GET /{index}/_search_shards
ClusterSearchShards GET /{index}/{type}/_search_shards
ClusterSearchShards POST /_search_shards
ClusterSearchShards POST /{index}/_search_shards
ClusterSearchShards POST /{index}/{type}/_search_shards
ClusterState GET /_cluster/state
ClusterUpdateSettings PUT /_cluster/settings
Count GET /_count
Count GET /{index}/_count
Count GET /{index}/{type}/_count
Count POST /_count
Count POST /{index}/_count
Count POST /{index}/{type}/_count
Create POST /{index}/{type}/{id}/_create
Create PUT /{index}/{type}/{id}/_create
CreateIndex POST /{index}
CreateIndex PUT /{index}
Delete DELETE /{index}/{type}/{id}
DeleteByQuery DELETE /{index}/_query
DeleteByQuery DELETE /{index}/{type}/_query
DeleteIndex DELETE /
DeleteIndex DELETE /{index}
DeleteIndexTemplate DELETE /_template/{name}
DeleteMapping DELETE /{index}/{type}/_mapping
DeleteWarmer DELETE /{index}/_warmer
DeleteWarmer DELETE /{index}/_warmer/{name}
DeleteWarmer DELETE /{index}/{type}/_warmer/{name}
Explain GET /{index}/{type}/{id}/_explain
Explain POST /{index}/{type}/{id}/_explain
Flush GET /_flush
Flush GET /{index}/_flush
Flush POST /_flush
Flush POST /{index}/_flush
GatewaySnapshot POST /_gateway/snapshot
GatewaySnapshot POST /{index}/_gateway/snapshot
Get GET /{index}/{type}/{id}
GetAliases GET /_alias/{name}
GetAliases GET /{index}/_alias/{name}
GetIndexTemplate GET /_template
GetIndexTemplate GET /_template/{name}
GetIndicesAliases GET /_aliases
GetIndicesAliases GET /{index}/_aliases
GetMapping GET /_mapping
GetMapping GET /{index}/_mapping
GetMapping GET /{index}/{type}/_mapping
GetSettings GET /_settings
GetSettings GET /{index}/_settings
GetSource GET /{index}/{type}/{id}/_source
GetWarmer GET /{index}/_warmer
GetWarmer GET /{index}/_warmer/{name}
GetWarmer GET /{index}/{type}/_warmer/{name}
Head HEAD /{index}/{type}/{id}
HeadIndexTemplate HEAD /_template/{name}
HeadSource HEAD /{index}/{type}/{id}/_source
Index POST /{index}/{type}
Index POST /{index}/{type}/{id}
Index PUT /{index}/{type}/{id}
IndexDeleteAliases DELETE /{index}/_alias/{name}
IndexFilteredStatsCompletion GET /{index}/_stats/completion
IndexFilteredStatsCompletion GET /{index}/_stats/completion/{fields}
IndexFilteredStatsDocs GET /{index}/_stats/docs
IndexFilteredStatsFielddata GET /{index}/_stats/fielddata
IndexFilteredStatsFielddata GET /{index}/_stats/fielddata/{fields}
IndexFilteredStatsFilter_cache GET /{index}/_stats/filter_cache
IndexFilteredStatsFlush GET /{index}/_stats/flush
IndexFilteredStatsGet GET /{index}/_stats/get
IndexFilteredStatsId_cache GET /{index}/_stats/id_cache
IndexFilteredStatsIndexing GET /{index}/_stats/indexing
IndexFilteredStatsIndexing GET /{index}/_stats/indexing/{indexingTypes2}
IndexFilteredStatsMerge GET /{index}/_stats/merge
IndexFilteredStatsPercolate GET /{index}/_stats/percolate
IndexFilteredStatsRefresh GET /{index}/_stats/refresh
IndexFilteredStatsSearch GET /{index}/_stats/search
IndexFilteredStatsSearch GET /{index}/_stats/search/{searchGroupsStats2}
IndexFilteredStatsStore GET /{index}/_stats/store
IndexFilteredStatsWarmer GET /{index}/_stats/warmer
IndexPutAlias PUT /_alias
IndexPutAlias PUT /{index}/_alias
IndexPutAlias PUT /{index}/_alias/{name}
IndexPutAliasByName PUT /_alias/{name}
Indices GET /_cat/indices
Indices GET /_cat/indices/{index}
IndicesAliases POST /_aliases
IndicesExists HEAD /{index}
IndicesSegments GET /_segments
IndicesSegments GET /{index}/_segments
IndicesStatCompletion GET /_stats/completion
IndicesStatCompletion GET /_stats/completion/{fields}
IndicesStatDocs GET /_stats/docs
IndicesStatFielddata GET /_stats/fielddata
IndicesStatFielddata GET /_stats/fielddata/{fields}
IndicesStatFilter_cache GET /_stats/filter_cache
IndicesStatFlush GET /_stats/flush
IndicesStatGet GET /_stats/get
IndicesStatId_cache GET /_stats/id_cache
IndicesStatIndexing GET /_stats/indexing
IndicesStatIndexing GET /_stats/indexing/{indexingTypes1}
IndicesStatMerge GET /_stats/merge
IndicesStatPercolate GET /_stats/percolate
IndicesStatRefresh GET /_stats/refresh
IndicesStatSearch GET /_stats/search
IndicesStatSearch GET /_stats/search/{searchGroupsStats1}
IndicesStatStore GET /_stats/store
IndicesStatWarmer GET /_stats/warmer
IndicesStats GET /_stats
IndicesStats GET /{index}/_stats
IndicesStatus GET /_status
IndicesStatus GET /{index}/_status
Main GET /
Main HEAD /
Master GET /_cat/master
MoreLikeThis GET /{index}/{type}/{id}/_mlt
MoreLikeThis POST /{index}/{type}/{id}/_mlt
MultiGet GET /_mget
MultiGet GET /{index}/_mget
MultiGet GET /{index}/{type}/_mget
MultiGet POST /_mget
MultiGet POST /{index}/_mget
MultiGet POST /{index}/{type}/_mget
MultiPercolate POST /_mpercolate
MultiPercolate POST /{index}/_mpercolate
MultiPercolate POST /{index}/{type}/_mpercolate
MultiSearch GET /_msearch
MultiSearch GET /{index}/_msearch
MultiSearch GET /{index}/{type}/_msearch
MultiSearch POST /_msearch
MultiSearch POST /{index}/_msearch
MultiSearch POST /{index}/{type}/_msearch
MultiTermVectors GET /_mtermvectors
MultiTermVectors GET /{index}/_mtermvectors
MultiTermVectors GET /{index}/{type}/_mtermvectors
MultiTermVectors POST /_mtermvectors
MultiTermVectors POST /{index}/_mtermvectors
MultiTermVectors POST /{index}/{type}/_mtermvectors
NodeInfoHttp GET /_nodes/{nodeId}/http
NodeInfoJvm GET /_nodes/{nodeId}/jvm
NodeInfoNetwork GET /_nodes/{nodeId}/network
NodeInfoOs GET /_nodes/{nodeId}/os
NodeInfoPlugin GET /_nodes/{nodeId}/plugin
NodeInfoProcess GET /_nodes/{nodeId}/process
NodeInfoSettings GET /_nodes/{nodeId}/settings
NodeInfoThread_pool GET /_nodes/{nodeId}/thread_pool
NodeInfoTransport GET /_nodes/{nodeId}/transport
NodeStatsFs GET /_nodes/{nodeId}/stats/fs
NodeStatsHttp GET /_nodes/{nodeId}/stats/http
NodeStatsIndices GET /_nodes/{nodeId}/stats/indices
NodeStatsIndices GET /_nodes/{nodeId}/stats/indices/{flags}
NodeStatsIndices GET /_nodes/{nodeId}/stats/indices/{flags}/{fields}
NodeStatsJvm GET /_nodes/{nodeId}/stats/jvm
NodeStatsNetwork GET /_nodes/{nodeId}/stats/network
NodeStatsOs GET /_nodes/{nodeId}/stats/os
NodeStatsProcess GET /_nodes/{nodeId}/stats/process
NodeStatsThread_pool GET /_nodes/{nodeId}/stats/thread_pool
NodeStatsTransport GET /_nodes/{nodeId}/stats/transport
Nodes GET /_cat/nodes
NodesHotThreads GET /_nodes/hot_threads
NodesHotThreads GET /_nodes/{nodeId}/hot_threads
NodesInfo GET /_nodes
NodesInfo GET /_nodes/{nodeId}
NodesInfoHttp GET /_nodes/http
NodesInfoJvm GET /_nodes/jvm
NodesInfoNetwork GET /_nodes/network
NodesInfoOs GET /_nodes/os
NodesInfoPlugin GET /_nodes/plugin
NodesInfoProcess GET /_nodes/process
NodesInfoSettings GET /_nodes/settings
NodesInfoThread_pool GET /_nodes/thread_pool
NodesInfoTransport GET /_nodes/transport
NodesRestart POST /_cluster/nodes/_restart
NodesRestart POST /_cluster/nodes/{nodeId}/_restart
NodesShutdown POST /_cluster/nodes/{nodeId}/_shutdown
NodesShutdown POST /_shutdown
NodesStats GET /_nodes/stats
NodesStats GET /_nodes/{nodeId}/stats
NodesStatsFs GET /_nodes/stats/fs
NodesStatsHttp GET /_nodes/stats/http
NodesStatsIndices GET /_nodes/stats/indices
NodesStatsIndices GET /_nodes/stats/indices/{flags}
NodesStatsIndices GET /_nodes/stats/indices/{flags}/{fields}
NodesStatsJvm GET /_nodes/stats/jvm
NodesStatsNetwork GET /_nodes/stats/network
NodesStatsOs GET /_nodes/stats/os
NodesStatsProcess GET /_nodes/stats/process
NodesStatsThread_pool GET /_nodes/stats/thread_pool
NodesStatsTransport GET /_nodes/stats/transport
OpenIndex POST /_open
OpenIndex POST /{index}/_open
Optimize GET /_optimize
Optimize GET /{index}/_optimize
Optimize POST /_optimize
Optimize POST /{index}/_optimize
PendingClusterTasks GET /_cluster/pending_tasks
Percolate GET /{index}/{type}/_percolate
Percolate GET /{index}/{type}/{id}/_percolate
Percolate POST /{index}/{type}/_percolate
Percolate POST /{index}/{type}/{id}/_percolate
PercolateCount GET /{index}/{type}/_percolate/count
PercolateCount GET /{index}/{type}/{id}/_percolate/count
PercolateCount POST /{index}/{type}/_percolate/count
PercolateCount POST /{index}/{type}/{id}/_percolate/count
PutIndexTemplate POST /_template/{name}
PutIndexTemplate PUT /_template/{name}
PutMapping POST /{index}/_mapping
PutMapping POST /{index}/{type}/_mapping
PutMapping PUT /{index}/_mapping
PutMapping PUT /{index}/{type}/_mapping
PutWarmer PUT /{index}/_warmer/{name}
PutWarmer PUT /{index}/{type}/_warmer/{name}
Refresh GET /_refresh
Refresh GET /{index}/_refresh
Refresh POST /_refresh
Refresh POST /{index}/_refresh
Search GET /_search
Search GET /{index}/_search
Search GET /{index}/{type}/_search
Search POST /_search
Search POST /{index}/_search
Search POST /{index}/{type}/_search
SearchScroll GET /_search/scroll
SearchScroll GET /_search/scroll/{scroll_id}
SearchScroll POST /_search/scroll
SearchScroll POST /_search/scroll/{scroll_id}
Shards GET /_cat/shards
Suggest GET /_suggest
Suggest GET /{index}/_suggest
Suggest POST /_suggest
Suggest POST /{index}/_suggest
TermVector GET /{index}/{type}/{id}/_termvector
TermVector POST /{index}/{type}/{id}/_termvector
TypesExists HEAD /{index}/{type}
Update POST /{index}/{type}/{id}/_update
UpdateSettings PUT /_settings
UpdateSettings PUT /{index}/_settings
ValidateQuery GET /_validate/query
ValidateQuery GET /{index}/_validate/query
ValidateQuery GET /{index}/{type}/_validate/query
ValidateQuery POST /_validate/query
ValidateQuery POST /{index}/_validate/query
ValidateQuery POST /{index}/{type}/_validate/query

References:
https://www.elastic.co/blog/ingesting-and-exploring-scientific-papers-using-elastic-cloud
https://www.elastic.co/guide/en/elasticsearch/plugins/master/using-ingest-attachment.html
https://gist.github.com/karmi/5594127 - old ver
https://github.com/rahulsinghai/elasticsearch-ingest-attachment-plugin-example - old ver ex
with node,
https://www.elastic.co/guide/en/elasticsearch/client/java-
rest/current/_performing_requests.html java rest 5.0 making req
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/_getting_started.html
https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-docs.html - check if
bulk api available for rest api
https://github.com/mauricio/elasticsearch-with-
attachment/blob/master/src/main/java/org/techbot/Sample.java- java sample for base64 conv
and old indexing

http://elasticsearch-cheatsheet.jolicode.com/
http://okfnlabs.org/blog/2013/07/01/elasticsearch-query-tutorial.html

Potrebbero piacerti anche