Dev Dive: Index-time vs. Query-time ElasticSearch Synonyms

08 April 2014

(I gave a general, "tech-light" overview of ElasticSearch here. Here, I wanted to give a much more technical example of a problem that I particularaly enjoyed solving as I've worked with ElasticSearch.)

Around a year ago, our startup poured a lot of energy into creating a strong internal search engine - since we were a deal aggregator, we wanted our users to be able to search our deal database to find exactly the one they wanted (or be assured that the deal they're looking for doesn't exist). This simple idea, however, led to many interesting challenges with ElasticSearch that spawned from various kinds of UX needs. This post is one example, a case study where we had to use different synonyms at indexing time vs. at querying time.

The Situation: We have lots of local daily deals in our database (from sources like LivingSocial, AmazonLocal, and Groupon), and we learned through our internal logging that the queries "restaurants" and "food" were some of the most common ones from users. Do these words mean the same thing? Based on our content, we decided that they did not. For example, we have deals that are for food purchased at a store (e.g. "$10 off of $20 of Food at Market X") - these would be appropriate for a "food" query, but not necessarily for "restaurant". However, deals at restaurants (e.g. "$20 off of $40 at Pete's Italian Restaurant") should be a reasonable result for both of the example queries. Essentially, we wanted restaurant deals to return for the query "food", but we did not want all food deals to return for the query "restaurant".

So, how do we create the desired behavior? The answer seems to involve synonyms. Synonyms are an analyzer filter that processes text (really, tokens) going into the search index (for either indexing or querying); the synonyms filter allows you to "equate" words (e.g. we could declare that "restaurant" and "food" mean the same thing). Here is a simple implementation of synonyms for our example:

{
  "index" : {
    "analysis" : {
      "analyzer" : {
        "my_analyzer" : {
          "tokenizer" : "whitespace",
          "filter" : ["standard", "asciifolding", "lowercase", "kstem", "custom_synonym"]
        }
      },
      "filter" : {
        "custom_synonym" : {
          "type" : "synonym",
          "synonyms" : ["restaurant, food"]
        }
      }
    }
  }
}

Note that above, we equate 'restaurant' (singular) with 'food' because the synonyms filter comes after the kstem filter, so by the time incoming tokens get processed by the synonyms filter, the word 'restaurants' will have already been stemmed down to 'restaurant'.

This does not solve our UX problem, however, because this will treat "food" and "restaurants" equally. We can check that this is so by using the Analyze API, which shows us how ElasticSearch will treat incoming text:

GET _analyze/?analyzer=my_analyzer&text=food

# Result:
{
  "tokens": [
    {
      "token": "restaurant",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "food",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 1
    }
  ]
}
GET _analyze/?analyzer=my_analyzer&text=restaurant

# Result:
{
   "tokens": [
      {
         "token": "restaurant",
         "start_offset": 0,
         "end_offset": 10,
         "type": "SYNONYM",
         "position": 1
      },
      {
         "token": "food",
         "start_offset": 0,
         "end_offset": 10,
         "type": "SYNONYM",
         "position": 1
      }
   ]
}

There would be no difference between the queries "restaurant" and "food"!

Another idea is that we should leverage ElasticSearch's ability to create one-directional synonyms (aka "explicit matchings"). In the index settings, we could define the synonyms to be "restaurant => food", which suggests that the word "restaurant" means "food" but not vice versa.

Turns out that this will not work either. The problem here is that the synonym is being applied at index time. This means that if we point "restaurant" to "food", any instance of the word "restaurant" that we put into the index will essentially be replaced by the word "food". Here is what the Analyze API looks like now:

GET _analyze/?analyzer=my_analyzer&text=food

# Result:
{
  "tokens": [
    {
      "token": "food",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}
GET _analyze/?analyzer=my_analyzer&text=restaurant

# Result:
{
  "tokens": [
    {
      "token": "food",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 1
    }
  ]
}

To further illustrate the problem, we can put a couple of documents into the index (assuming that process the "description" field with the my_analyzer from above)...

POST deal/1
{
  "description": "This is a great deal for food from the grocery store."
}

POST deal/2
{
  "description": "This is an awesome deal for a restaurant."
}

... and then we can see that a query for "restaurant" is actually acting like a query just for the word "food":

POST _search/
{
  "query": {
    "match": {
      "description": "restaurant"
    }
  }
}

# Result:
{
  ...
  "hits": {
    "total": 1,
    "max_score": 0.11506981,
    "hits": [
      {
        "_index": "simple_index",
        "_type": "deal",
        "_id": "1",
        "_score": 0.11506981,
        "_source": {
          "description": "This is a great deal for food from the grocery store."
        }
      }
    ]
  }
}

The solution requires us to be more nuanced. We want to maintain a distinction between "restaurant" and "food" in the index, but we want to muddle this distinction when processing a user's query. The way to do this is to differentiate between index-time synonyms and query-time synonyms.

Specifically, here we want to say that there is no link between "restaurants" and "food" at index time - restaurant deals are restaurant deals, and food deals are food deals. But, at query time, we want a query for "food" to return all "food" deals and all "restaurant" deals. Perhaps unintuitively, this means we want to create the synonym "food => food, restaurant". And this finally will give us our desired result.

Here's what the new settings and mapping look like:

{
  'settings': {
    "index": {
      "analysis" : {
        "analyzer" : {
          "my_search_analyzer" : {
            "tokenizer" : "whitespace",
            "filter" : ["standard", "asciifolding", "lowercase", "kstem", "search_synonym"]
          }
        },
        "filter" : {
          "search_synonym" : {
            "type" : "synonym",
            "synonyms" : ["food => food, restaurant"]
          }
        }
      }
    }
  },
  "mappings": {
    "deal": {
      "properties": {
        "description": {
          "type": "string",
          "index_analyzer": "standard",
          "search_analyzer": "my_search_analyzer"
        }
      }
    }
  }
}

Here's how the analyzer processes the different terms:

GET _analyze/?analyzer=my_search_analyzer&text=restaurant

# Result:
{
  "tokens": [
    {
      "token": "restaurant",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 1
    }
  ]
}
GET _analyze/?analyzer=my_search_analyzer&text=food

# Result:
{
 "tokens": [
    {
      "token": "food",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "restaurant",
      "start_offset": 0,
      "end_offset": 4,
      "type": "SYNONYM",
      "position": 1
    }
 ]
}

And finally, we have the desired behavior for both "restaurant" and "food" queries:

POST _search/
{
  "query": {
    "match": {
      "description": "restaurant"
    }
  }
}

# Result:
{
  ...
  "hits": {
    "total": 1,
    "max_score": 0.15342641,
    "hits": [
      {
        "_index": "simple_index_8",
        "_type": "deal",
        "_id": "2",
        "_score": 0.15342641,
        "_source": {
          "description": "This is an awesome deal for a restaurant."
        }
      }
    ]
  }
}
POST _search/
{
  "query": {
    "match": {
      "description": "food"
    }
  }
}

# Result:
{
  ...
  "hits": {
    "total": 2,
    "max_score": 0.04500804,
    "hits": [
      {
        "_index": "simple_index_8",
        "_type": "deal",
        "_id": "2",
        "_score": 0.04500804,
        "_source": {
          "description": "This is an awesome deal for a restaurant."
        }
      },
      {
        "_index": "simple_index_8",
        "_type": "deal",
        "_id": "1",
        "_score": 0.033756033,
        "_source": {
          "description": "This is a great deal for food from the grocery store."
        }
      }
    ]
  }
}

This is a pretty good example of the power of ElasticSearch, but also of how the creator of an index has to be careful with the settings in order to create behavior that makes sense from a UX point of view. Of course, in our production database we have many more synonyms, including ones that we do equate with "restaurant" at indexing time (e.g. "dining"), and also synonyms that fulfill a different purpose (e.g. standardizing spellings, like "work out => workout"). But in short, it's pretty awesome how ElasticSearch can handle all of these nuances!


comments powered by Disqus