Cache invalidation using Elasticsearch percolation

This article shows how to use Elasticsearch's percolation feature to invalidate cached query results.

For the website of one of our customers, we make extensive use of Elasticsearch – an open source search server. When a page is requested, the Elasticsearch search index is queried and the retrieved information is used to build up the web page dynamically. Basically we're using Elasticsearch as a database which the web application queries.

Elasticsearch as web application database Elasticsearch as web application database

Rendering pages this way offers great flexibility. For instance, it enables serving fully customized pages based on who is requesting the page, enables showing of personalised content, related content, etc. However, this flexibility comes at a price–the number of search queries executed to build up the homepage alone is around 37. Especially when performing more complex queries, the overhead of executing these queries adds up to a significant portion of the time it takes to load the page.

Dynamically build up web page using Elasticsearch Dynamically build up web page using Elasticsearch

Caching the query results saves you the time it takes to execute the query. This can be implemented quite easily by using a key/value cache store like Memcached, where the key is a unique identifier computed from the query text and the value the result of executing the query. Having the cache store close to the web server also saves you the latency of sending the search request and receiving the results from the search server.

Using caching when dynamically building up a web page Using caching when dynamically building up a web page

The difficulty with caching is cache invalidation: knowing when and what entries to remove in order to prevent stale data. In this case, the question when to remove cached entries is easily answered: whenever a new or updated document is indexed. The question what to remove is a lot harder to answer. One solution is to just clear the complete cache. However, until the cache is filled up again, this will result in web page loading very slowly. We must be able to do better than that! So how can we just purge the cached queries relevant to the newly indexed data?

It would probably be very helpful to know which queries would return the newly indexed document. Enter percolation.


Percolation is basically the opposite of searching. Searching can be described as follows: given a query, return all documents matching that query. Percolation is: given a document, return all queries that, when executed, would return that document.

Percolation versus searching in Elasticsearch Percolation versus searching in Elasticsearch

So whenever a new or updated document is indexed, the document is percolated, and all matching queries are returned. Now it is straightforward to remove the cached entries for those queries from the cache store: compute the unique identifier from the query text and remove the entry having that identifier from the cache.

Using Elasticsearch to do cache invalidation Using Elasticsearch to do cache invalidation


In order to highlight our ideas, we'll use a simple example that you can replicate on your own computer. The only prerequisites are running instances of Elasticsearch (> 1.0) and Memcached.

First, create a new index named 'library':

              curl -XPUT localhost:9200/library

Next, we'll add some books:

              curl -XPOST localhost:9200/library/software-engineering/book1 -d '{
                "title": "Refactoring",
                "author": "Martin Fowler",
                "category": "software_engineering"
              curl -XPOST localhost:9200/library/software-engineering/book2 -d '{
                "title": "Object-oriented software and design",
                "author": "Bertrand Meyer",
                "category": "software_engineering"
              curl -XPOST localhost:9200/library/fiction/book3 -d '{
                "title": "My uncle Oswald",
                "author": "Roald Dahl",
                "category": "fiction"

A search query for all books in the category 'software_engineering' could look like this:

              curl -XGET localhost:9200/library/_search -d '{
                "filter": {"term": {"category": "software_engineering"}}

As long as the content doesn't change, the result of the queries can be cached. In order to cache the results we could take an MD5 hash of the search predicates:

              MD5("filter":{"term":{"category":"software_engineering"}}) = "7e7e49fef9a0dbb96b09be84912fb50b"

We can now store the search results under that key in the cache store.

The next time the query is performed, the hash is calculated from the search predicates again, and the cache is checked to see if there already is a response for that query. If so, the response is returned from cache. If not, the query is sent to Elasticsearch and the result is cached using the calculated key.

Cache invalidation

The query itself must be indexed in order for Elasticsearch to return all queries that match a certain document. Elasticsearch provides a reserved index type named .percolator for this purpose. An index is just a JSON document and it can be indexed as follows:

              curl -XPUT localhost:9200/.percolator/7e7e49fef9a0dbb96b09be84912fb50b -d '{
                  {"query": {"filtered": {
                    "filter": {"term": {"category": "software_engineering"}}

Note that the original filter must be wrapped inside a query so that it can indexed.

Let's see what happens when we percolate a new document now that the query has been added to the percolater:

              curl -XGET localhost:9200/library/books/_percolate -d '{
                "doc": {
                  "title": "Erlang Programming",
                  "author": "Francesco Cesarini & Simon Thompson",
                  "category": "software_engineering"

The response of that request is:

                "took": 2, "_shards": {"total":5, "successful":5, "failed":0}, "total":1,
                "matches": [
                  {"_index": "library", "_id": "7e7e49fef9a0dbb96b09be84912fb50b"}

In this example 7e7e49fef9a0dbb96b09be84912fb50b is the key to remove from the cache store. The next time the query is executed a cache miss will occur and the query will be sent to Elasticsearch. The updated search results will be cached again under the same key.

The _percolate request shown above creates a temporary index in memory just for the document that is being percolated. Next, all queries registered in the percolator index are executed one by one to see if they would return the document.


Using a search index for building up web pages dynamically is ideal for serving customized pages to users. However, this flexibility comes at a price: increased page response times due to search queries being executed to load the page content. The page response times can be significantly reduced by using a typical key/value cache store such as Memcached close to the web server to cache the search results.

Using a simple example, we have demonstrated the possibility of using Elasticsearch's percolation feature in order to know what cached entries should be removed so stale data is prevented.