ElasticSearch: Advanced Tips & Tricks

06 May 2014

The fun with ElasticSearch never ends. Since our company implemented ElasticSearch around a year ago, I've learned a lot about it (see my previous posts on a non-tech introduction to ElasticSearch, as well as a case study on using different index-time vs. query-time synonyms). Recently, I attended one of ElasticSearch's "Core" training sessions, and I wanted to share some of the more interesting tips that were mentioned. This is again a more "tech-heavy" post, one which demonstrates some of the product-oriented (as opposed to infrastructure-oriented) strengths of ElasticSearch.

(Since our company is a deal aggregator, bringing together deals from sources like LivingSocial, AmazonLocal, and Groupon, the basic "document" that I will assume we are indexing is a deal, which can have fields like a title, price, description, etc.)

Null values when indexing help to avoid problems when scripting

We have a custom scoring script for our queries, which allows us to emphasize results that we think would be more relevant for users. For example, we might want to give deals from businesses with higher Yelp ratings a higher score.

The default custom scoring language is MVEL, a somewhat-archaic language that is, to put it lightly, neither the most friendly language nor the one with the most documentation (here is ElasticSearch's scripting documentation, and here is the documentation for MVEL. Good luck!). One annoyance we had was dealing with documents that did not have a value for a field used in our custom scoring script. For example, some deals in our index do not have a Yelp rating, perhaps because the business is not rated. In these cases, we originally indexed the deal document without a Yelp rating - easy, right?

We ended up contorting our script to accommodate these cases. For basically any field used in the script that might have been null, we did something like:

effective_yelp = 0; 
if (!doc['yelp_rating'].empty) {
    effective_yelp = (Double) doc['yelp_rating'].value; 
}

Yikes. Turns out, the easier solution would've been to just set a "null_value" for that field in our mapping:

{
    "deal": {
        "properties": {
            "price": {
                "type": "string",
                "store": true,
                "null_value": "na"
            },
            ...
        }
    }
}

This is a relatively straight-forward way to avoid the messy scripting, and in hindsight should've been much more obvious to us!

Indexing the same text fields in multiple ways to achieve different query results

Let's say that we have some deals that have a title mentioning the business "Healthy Yoga" by name. Normally, we might process this business's name through analyzers that stem the words, storing that phrase as "health yoga".

But, this poses an interesting UX problem: if a user types in the query "Healthy Yoga", we may want to return deals from this business, but if that particular business didn't exist, we'd want to return deals with the words "health" and "yoga" somewhere in the description. How can we do this?

One trick suggested at training was to index the title of the deal twice, a feature which used to be called multi-fields. One of these fields would have the title in a relatively less-processed form (e.g. just lower-casing the words), while another field would have a normal, more highly-processed form (e.g. with stemmers and synonyms).

You could set a shingles token filter and a field boost factor to the less-processed field to massively boost the correct result for the query "Healthy Yoga". Then, you can create some application-side logic to, for example, hide lower-scored hits (i.e. ones where the shingled, exact phrase did not appear) when a highly-scored hit is available.

Filter Aliases

For complex queries, filters in the query JSON can get fairly messy. Turns out though, that you can create filtered aliases. These work like aliases on an index, but restrict the index based on the filter without you having to do it in the query.

The documentation has a good example of using this. With a filtered alias in place, you then send the query to the filtered alias name instead of to the underlying index name. I'm not sure yet if you can stack aliases together dynamically.

Some smaller hints

Don't offer users a "last page" of search results. As a coder who cares about the UX, you might think that users want to be able to see the last page of search results. It turns out that calculating this last page for larger queries is extremely expensive for ElasticSearch, much more expensive than returning an earlier page of hits. The ElasticSearch reps recommended removing the "last page" feature if at all possible, and instead offering reverse sorting of results.

Consider using "match" queries instead of "query_string". The query_string search uses a query parser, which allows users to do "x AND y OR z". You might not need this feature; if that's true, the reps suggested that "match" queries might be more efficient.

Use bool filters for all filters not involving geo calculations or custom scripts. This has to do with filter caching, which allows ElasticSearch to more efficiently filter documents when using boolean filters. Filters involving geo-points and custom scripts, however, cannot be cached, so those should still use and/or/not filters. You can combine boolean filters and and/or/not filters, and in those cases, you should invoke the and/or/not filters after the boolean ones.

Use bounding boxes instead of radius filters for geo-points. Determining if a point is inside a radius is more expensive than asking if a point is inside a geo-bounding-box. Intuitively this makes sense: when using a box, you can just check if the lat/long values are within the ranges of the box's borders, whereas a radius filter involves a calculation of distance (and between two points on a sphere, no less).

These were some of the tips offered by actual developers at ElasticSearch in the training; hopefully these will help you as you work with ElasticSearch!


comments powered by Disqus