To understand how to do exact matching in elasticSearch, it is important to understand how it works internally. Here is, what i understood of it.
How does ES work internally ?
When we insert data into ES, using PUT or POST API's the ES creates a inverted index for the inserted data. Any type of search query that is executed searches for the data in this inverted index and not in the actual data stored.
Lets take a very simple example to understand it.
Let say there are two documents inserted (using Sense chrome extension for ES) using below queries:
PUT facebook/post/1
{
"user" : " Tywin Lanister",
"message": "Post1"
}
PUT facebook/post/2
{
"user" : " Tyrion lanister",
"message": " post2"
}
Inverted index created for these docs would be some thing like this:
tywin : doc1
tyrion: doc2
lanister: doc1,doc2
post1: doc1
post2: doc2
If you have observed, the string field is tokenized and converted to lowercase before the index creation.
This is called the Analysis phase.
There is more that happens in this phase, but let us get back to that later.
So when we execute a search query like:
GET /facebook/post/_search{
"query": {
"match": {
"user": "Tywin Lanister"
}
}
}
If one is expecting only the doc1 in the results, then (s)he is in for a surprise. .
The results of the above query would contain both doc1 and doc2.
The match query is not a equals query, it is a contains query.
So the above query checks if the word "tywin" or "lanister" exists in the inverted index.
If we refer the above inverted index , we see that the word "tywin" fetches only doc1 but the word "lanister" fetches both doc1 and doc2. And, thats the reason, the results contain both doc1 and doc2.
If you have observed, the match query searches for the words "tywin" or "lanister", instead of "Tywin" or "Lanister" as passed in the query. The reason for this is that the search query also goes through analysis phase.
Important thing to note is that, not all queries are analyzed.
For example, match is a high level query that is analyzed, where as term is a low level query which is not analyzed.
Refer these links for more details on how match works:
So to summarize, ES performs analysis during insertion of the data and creates a inverted index.
Later when a search query is executed, it is analyzed and the analyzed data is searched in this inverted index.
The results of the above query would contain both doc1 and doc2.
The match query is not a equals query, it is a contains query.
So the above query checks if the word "tywin" or "lanister" exists in the inverted index.
If we refer the above inverted index , we see that the word "tywin" fetches only doc1 but the word "lanister" fetches both doc1 and doc2. And, thats the reason, the results contain both doc1 and doc2.
If you have observed, the match query searches for the words "tywin" or "lanister", instead of "Tywin" or "Lanister" as passed in the query. The reason for this is that the search query also goes through analysis phase.
Important thing to note is that, not all queries are analyzed.
For example, match is a high level query that is analyzed, where as term is a low level query which is not analyzed.
Refer these links for more details on how match works:
So to summarize, ES performs analysis during insertion of the data and creates a inverted index.
Later when a search query is executed, it is analyzed and the analyzed data is searched in this inverted index.
For more details on analysis, refer this link: Analysis & analyzers.
By default, all fields are analyzed.
However there is a provision in ES to avoid analysis phase for a particular field.
This setting can be done with the use of mapping.
Note: Before setting the field as "not_analyzed", we need to make sure the old data is deleted.
The mapping for the above example can be defined like this:
PUT /facebook/post/_mapping
{
"post": {
"properties": {
"user": {"type":"string", "store":"true", "index":"analyzed" },
"message": {"type":"string", "store":"true", "index":"analyzed" }
}
}
}
The mapping for the above example can be defined like this:
PUT /facebook/post/_mapping
{
"post": {
"properties": {
"user": {"type":"string", "store":"true", "index":"analyzed" },
"message": {"type":"string", "store":"true", "index":"analyzed" }
}
}
}
How to do exact matching in ES ?
Now coming on to how to do exact matching in ES, we need to define the field on which we need to do exact matching as "not_analyzed".
In the above example, if we define the field "user" as "not_analyzed", then the inverted index created will be some thing like this:
Tywin Lanister : doc1
So, when the above match query is run, it skips the analysis phase and searches for the string "Tywin Lanister" as passed in the query, and the results rightfully contain doc1 only.
For exact matching, we need to pass the exact string. If we pass "tywin lanister" in the match clause instead, the results will contain no docs.
In the above context, i found these link very useful.
http://gauth.fr/2012/08/exact-search-with-elasticsearch/
In the above example, if we define the field "user" as "not_analyzed", then the inverted index created will be some thing like this:
Tywin Lanister : doc1
Tyrion lanister: doc2
post1: doc1
post2: doc2
post2: doc2
So, when the above match query is run, it skips the analysis phase and searches for the string "Tywin Lanister" as passed in the query, and the results rightfully contain doc1 only.
For exact matching, we need to pass the exact string. If we pass "tywin lanister" in the match clause instead, the results will contain no docs.
In the above context, i found these link very useful.
http://gauth.fr/2012/08/exact-search-with-elasticsearch/