Four ways to index relational data in Elasticsearch

Updated on
Please note that the information in this blogpost is outdated. Consider building a feeder to push your data instead. See how we keep our Elasticsearch index updated with data from Microsoft SQL Server.

We will explain how to make relational databases searchable using a search index. We use four different cases to show how the indexing strategy depends on the data model.

When an application requires advanced search, for example faceted search or full text search, a relational database alone will not suffice. That's when a search index will come in handy.

This article is useful for people who want to create a search index for (part of) their relational data. We explain how to use the JDBC river plugin with Elasticsearch to index any data that can be retrieved through a JDBC connection.

Elasticsearch and the JDBC river

At Voormedia we’re all a big fan of Elasticsearch. We like it’s simplicity, scalability, speed and the easy to use REST API. In this article we will use Elasticsearch together with the JDBC river plugin to index and synchronize data from a relational database.

An Elasticsearch river represents a dataflow between an external datasource and the Elasticsearch index. Rivers are put into their own _river index, which is shown below. The following command shows how to create a river that automatically updates the index every day at 1 AM:

curl -XPUT localhost:9200/_river/product/_meta -d '
  {
    "type": "jdbc",
    "jdbc": {
      "url": "jdbc:sqlserver://db.example.com;databaseName=products",
      "user": "user",
      "password": "pass",
      "sql": "SELECT * FROM products",
      "index": "example",
      "type": "product",
      "schedule": "00 00 01 * * ?"
    }
  }'

Case 1: Index a row as a document

Our example furniture company starts out with a very simple database consisting of only a products table storing the product name and the material used to craft it. The information is used in an application in which users will search for products based on crafting material. We will index this table into one index type:

Products example

The JDBC river query will select all columns from the products table, changing id to _id to match Elasticsearch's document id. This river will map each resulting row to one document. After indexing we can find all wooden products by means of:

Query

curl localhost:9200/example/product/_search -d '
  {
    "filter": { "term": { "material": "wood" } }
  }'

Result

"hits": [ {
  "_id": "0",
  "_source": {
    "name": "chair",
    "material": "wood"
} } ]

Case 2: Indexing multiple rows in one document

The furniture company decided to start producing doors as well. The doors are made of both wood and glass. Their simple database only allowed storing one material per product so it had to be redesigned. The table materials was introduced in order to store material as a repeating attribute for each product:

Products and materials

There are multiple ways to store this information in Elasticsearch. One option is to create an index type for both products and materials, but the big drawback of this is that we will need multiple queries to find products with their materials. Therefore we will store the material within the product. We join the two tables, creating the following view:

View View

The JDBC river plugin will group all the values for a column into an array per product id. We again search for wooden products:

Query

curl localhost:9200/example/product/_search -d '
  {
    "filter": { "term": { "material": "wood" } }
  }'

Result

"hits": [ {
  "_id": "0",
  "_source": {
    "name": "chair",
    "material": "wood"
  }
}, {
  "_id": "2",
  "_source": {
    "name": "door",
    "material": ["glass", "wood"]
  }
} ]

Case 3: Indexing a set of columns as nested object

The company decided that users should also be able to see the size of their furniture. Each product will have a width, height and depth and the application will allow users to set these values in their search query.

Products and sizes

We extend our database model and add a size table which has a 1-to-1 relation with product. Using structured objects we can add the values of this table as an object to our product. Prefixing the column names as done below provides us with a neatly structured document.

View

Let's find all the products that are higher than 200:

Query

curl localhost:9200/example/product/_search -d '
  {
    "filter" : { "numeric_range" : { "size.height" : { "gt": 200 } } }
  }'

Result

"hits" : [ {
  "_id" : "2",
  "_source" : {
    "name":"door",
    "size":{
      "width": 80,
      "height": 220,
      "depth": 10
      }
    }
  } ]
}

The problem with this approach is that it will only work if the products table has a 1-to-1 relationship with the sizes table. To illustrate the problem in a 1-to-n situation, lets assume the chair is available in two different sizes. Part of the resulting query will be:

View of multiple sizes

and if we were to index this information using the same river it will result in the following document:

Query

curl localhost:9200/example/product/0

Result

"_id": "0",
"_source": {
  "name": "chair",
  "size": {
    "width": 50,
    "height": [
      50,
      80
    ],
    "depth": 50
  }
}

Different values with the same name are combined into an array, which does not make sense in this case. Take a look at the next scenario for a solution to this problem.

Case 4: Indexing multiple nested objects per document

The furniture company now wants to be able to make products in different sizes. We will use the sizes table, but now in a 1-to-n relation to products.

Our goal is to store all sizes in a separate object array nested in the product document. This way we won't lose the original combination of width, height and depth as was the case in the previous example. We create the following view, using square brackets to group columns into a nested object:

Using square brackets to group columns

Let's search for products containing chair that are smaller than 70:

Query

curl localhost:9200/example/product/_search -d '
  { "query": { "filtered": {
      "query":  { "match": { "name": "chair" } },
      "filter": { "numeric_range": { "size.height": { "lt": 70 } } } } }
  }'

Result

"hits" : [ {
  "_id" : "0",
  "_source" : {
    "product": "chair",
    "size":[ {
        "width": 50,
        "height": 50,
        "depth": 50
      },{
        "width": 75,
        "height": 75,
        "depth": 75
      }
    ]
  }
} ]

The chair comes up as result, containing both sizes as objects in an array.

Summary

Indexing relational data into a search index is a great way to make it easily searchable. We've demonstrated four different ways to index relational data into Elasticsearch and provided clear examples to easily match with real scenarios. We explained how to:

  • directly map a table row to a document;
  • combine attributes from multiple rows into one document;
  • index a set of columns as nested object in a document;
  • index multiple nested objects per document;

Hopefully this article helps you with indexing your relational data in a way that suits your search needs!