Elasticsearch is a highly scalable search engine that stores data in a structure optimized for language based searches and it is a whole lot of fun to work with. In this 101 I’ll will give you a hands-on introduction to Elasticsearch and give you a glimpse at some of the key concepts.
- distributed: clustering, replication, fail over, and master election out of the box
- schema less: document based, automated type mapping, JSON
- RESTful API
- highly configurable
- sane defaults
- data exploration
- based on Lucene
- runs on the JVM
Elasticsearch is a standalone Java application, so getting up and running is a piece of cake. Make sure you have Java ≥ 1.6.0 and that no one else is running Elasticsearch in your network. You can either download a packaged distribution from elasticsearch.org and unpack it or get the latest source from github.
Start a node in foreground with
$ bin/elasticsearch -f
You should see something like this
You can see in the output that Elasticsearch started, that it assigned the node a random name, and that the node started a cluster and elected itself as master. The node is publishing to HTTP port 9200 (default). We use this port to interact with the cluster. Now you can
$ curl localhost:9200
and you get some data about your node
Elasticsearch has a JSON based REST API. Administrative operations, indexing and searching, everything is done with HTTP and JSON. I use cURL for more concise examples. It may be more convenient for you to use any graphical HTTP client. There is also a number of browser extensions and plugins. If you use Google Chrome then I recommend Sense plugin for Chrome, a JSON aware developer console to Elasticsearch.
Lets start up another node and give it a name NODE_2 as parameter
$ bin/elasticsearch f -Des.node.name=NODE_2
You can see that NODE_2 started. The node detected the other node as master and joined the cluster. The new node publishes to HTTP port 9201. You can talk to 9201 as well, as each node behaves the same. Any node starting up will join this cluster if, and only if it shares the cluster name. In our case as we haven’t defined a cluster name a default setting is in place.
You now have a Elasticsearch cluster running with two nodes.
Install Head plugin
This is an optional step. The Head plugin or elasticsearch-head is a web front end for browsing and interacting with an Elasticsearch cluster. For more details on available plugins refer to the plugin guide.
To install it run
$ bin/plugin -install mobz/elasticsearch-head
Then open http://localhost:9200/_plugin/head/ in your browser
A Search engine is something like a database with a difference in how data is stored. The structure is loosely similar to that of a conventional relational database like MySQL or Postgres for example.
- Elasticsearch – Database
- Index – Database
- Type – Table
- Document – Row
- Field – Column
Elasticsearch creates an inverted index. http://en.wikipedia.org/wiki/Inverted_index
Everything is indexed in this data structure. It allows to quickly find all the documents that contain a particular word. Much in the way of an index at the back of a book.
Let’s index something. We send off a HTTP PUT with the URL made up of the index name, type name and ID and in the HTTP payload we supply a JSON document with the fields and values. Notice the field author has another JSON object nested.
Indexing in Elasticsearch corresponds to create and update in CRUD. If we try to index a document with an ID that already exists it is overwritten. Index and type are required while the ID is optional. If we do not specify an ID, then Elasticsearch will generate one.
And we get a response that verifies that the operation was successful
We get the name of the index, a type, and the ID. We also get a version, which is not a historical version. The versioning is used for optimistic concurrency control and is always incremented with any changes. The data we have supplied was all Elasticsearch needed. Elasticsearch automatically created the index for us.
Elasticsearch is schema less. Elasticsearch is using mappings, which is basically a schema, but makes working with it much easier.
To see the mapping of our indexed blog
$ curl localhost:9200/documents/_mapping
Here you get the mapping for the index documents. There is the type blog and a list of all properties from our blog. Elasticsearch automatically predicts the data types. If Elasticsearch doesn’t predict the right type for your field, then you can supply a mapping when you create an index.
To get a set of data, we set off a HTTP GET with the index name, type and the ID.
$ curl -XGET "http://localhost:9200/documents/blog/one"
The response looks like this
We get metadata information and the source field containing the JSON that we have indexed.
A document needs to be indexed before you can search for it. Elasticsearch refreshes every second by default.
Let’s search for our data set
$ curl -XGET "http://localhost:9200/documents/blog/_search?q=_id:one"
This is a query on the ID one. It is using the search facilities, but as we are looking for the ID, this can only result into one or zero documents.
Lucene under the hood
Elasticsearch is built on top of Lucene, a very old Java library that is proven, tested and best of its kind in open source search software. Everything related to indexing and searching text is implemented in Lucene. Elasticsearch builds an infrastructure around Lucene. While Lucene is a great tool, it can be cumbersome to use it directly and Lucene doesn’t provide any facilities to scale past a single node. Elasticsearch provides an easier more intuitive API and the infrastructure and operational tools for simple scalability across multiple nodes. The REST API also allows interoperability with non-Java languages.
A shard in Elasticsearch refers to a Lucene index. Elasticsearch by default uses five shards for each Elasticsearch index. A document is stored in one shard. Elasticsearch supports replica shards. One replica is configured by default.
Let’s put in another document
Now lets search for the term english
we get two hits
We can also search on a specific field. Nested fields are addressed with a point separator like in the next example.
We get a result that matches the author name Pip the Troll
In the last example we used a prefix query. For more types of queries refer to the elasticsearch.org guide. We are not providing the full value of the field. This is the gist of a search engine. We don’t need to know exactly what we are searching for. We can provide what we know and get results on what might be true or not in contrary to what must be true. We can find word stems, synonyms, misspellings, and we can even provide autocompletion.
You can get some information about your cluster with
$ curl -XGET "http://localhost:9200/_cluster/health"
Alternatively, if you have installed the head plugin you can open http://localhost:9201/_plugin/head/. We have a status, which is green. We have 5 active primary shards and 10 active shards, because we’re running with two nodes. Our indexed documents are available on each node.
Lets shut down our master. Go to your console of your master node and press
CTRL-C then look at the console output of NODE_2.
You can see that NODE_2 noticed that the master has left the cluster and elected itself as new master. Check the status again
$ curl -XGET "http://localhost:9201/_cluster/health"
Make sure to use port 9201 of NODE_2 not 9200.
You can see that the cluster status is now yellow, because one of our two nodes is unassigned. The search functionality of the cluster is still working, though. If you set off our search requests from earlier again, but this time against port 9201 from NODE_2, you still get the search results, because everything we have indexed is available on every node.
If you start our previous master back up and check on the cluster status it’ll be back in status green.
At some point you will be interested in information about the data you have indexed. In our case we might be interested in what is the average number of words over all indexed blog articles. Elasticsearch has a feature called facets that provides aggregated statistics about a query. This is a core part of Elasticsearch and is part of the search API. Facets are always bound to a query and provide aggregate statistics alongside the query results. Facets are highly configurable and can return complex groupings of nested filters, spans of amounts or spans of time, even full Elasticsearch queries can be nested inside a Facet. In Elasticsearch 1.0 this feature will be called Aggregations and is supposed to have more features and be more composable.
Here we define a facet together with a match all query. The facet is a predefined statistical facet for number fields in this case our word_count field.
In the response we receive the query results first on top then the facet response with statistical numbers on our word_count field. There are further predefined facets you can choose from in order to get statistical information about your data and there is also Kibana that is a graphical front-end data analysis tool specifically tailored for Elasticsearch that became very popular.
No of course that’s not it. It’s just all I wanted to show you in this 101. I’ve introduced you to quiet a few topics but we barely scratched the surface of what there is to discover about Elasticsearch.
The query API has much more to offer than we have covered for instance. There are many interesting types of queries and filters that can be used. To get the most out of natural language based searches and other complex types of data, you’ll get in touch with analyzers. Analyzers are the tools to slice and dice words into stems to create an efficient search space for natural languages. The word stemming allows Elasticsearch to find linguistically similar words. Percolators are another very interesting topic. Percolators allow one to index queries and then send docs to Elasticsearch and find out which queries match the doc. So the entire operation turned around going in reverse direction. And there is even more.
I hope you found this post interesting and useful on your quest to discover this awesome peace of technology. Thank you for reading and stay in the loop for more posts to come on Elasticsearch. In the meanwhile I’ve put a list of links together, you can find them below, there is also a great interview with Github about Elasticsearch at scale.
Where to go from here
To learn more about Lucene go to the Lucene documentation or visit the Lucene wiki.
Inverview with Github about Elasticsearch at scale