Elasticsearch custom analyzer to ignore special characters
Posted By : Mohd Adnan | 31-Dec-2018
Image Credits: Elasticsearch
Elasticsearch has a wide range of in-built analyzers that can be used with any index directly, without any further configuration. For example The Standard Analyzer, Simple Analyzer, WhiteSpace Analyzer, Keyword Analyzer, etc.
The default sorting in Elasticsearch is based on ASCII equivalents which provide sorting results by special characters followed by numbers, lowercase alphabets, and upper case alphabets.
Problem
To Achieve alphabetical sorting ignoring special characters and numbers
Solution
Using Elasticsearch 6, this can be achieved using Custom Analyzer when in-built analyzers do not fulfill your needs.
The approach is to write a custom analyzer that ignores non-alphabetical characters and then query against that field
Step 1: Create a custom analyzer by using pattern replace character filter
Define a pattern replace character filter to remove any non-alphabetical characters on the index settings
"char_filter": {
"alphabets_char_filter": {
"type": "pattern_replace",
"pattern": "[^a-zA-Z]",
"replacement": ""
}
}
Then use that filter to create a custom analyzer that we created “alphabets_char_filter” on the index above:
"analysis": {
"analyzer": {
"alphabetsStringAnalyzer": {
"tokenizer": "standard",
"filter": "lowercase",
"type":"custom",
"char_filter": [
"alphabets_char_filter"
]
}
},
"char_filter": {
"alphabets_char_filter": {
"type": "pattern_replace",
"pattern": "[^a-zA-Z]",
"replacement": ""
}
}
}
Step 2: Define field mapping of the index using the custom analyzer
The next step is to define a new field mapping that uses the new “alphabetsStringAnalyzer” analyzer:
"title": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "alphabetsStringAnalyzer",
"fielddata" : true
}
}
}
Step 3: Run query against a new field
{
"sort": {
"title.raw": "asc"
},
"query": {
"term": {
"title": "random"
}
}
}
This will provide alphabetical sorting, ignoring the non-alphabetical characters which were the expected result.
Hope that helps.
Cookies are important to the proper functioning of a site. To improve your experience, we use cookies to remember log-in details and provide secure log-in, collect statistics to optimize site functionality, and deliver content tailored to your interests. Click Agree and Proceed to accept cookies and go directly to the site or click on View Cookie Settings to see detailed descriptions of the types of cookies and choose whether to accept certain cookies while on the site.
About Author
Mohd Adnan
Adnan, an experienced Backend Developer, boasts a robust expertise spanning multiple technologies, prominently Java. He possesses an extensive grasp of cutting-edge technologies and boasts hands-on proficiency in Core Java, Spring Boot, Hibernate, Apache Kafka messaging queue, Redis, as well as relational databases such as MySQL and PostgreSQL. Adnan consistently delivers invaluable contributions to a variety of client projects, including Vision360 (UK) - Konfer, Bitsclan, Yogamu, Bill Barry DevOps support, enhedu.com, Noorisys, One Infinity- DevOps Setup, and more. He exhibits exceptional analytical skills alongside a creative mindset. Moreover, he possesses a fervent passion for reading books and exploring novel technologies and innovations.