• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • Contact Us

iHash

News and How to's

  • Passwarden PW Manager Lifetime Subscription for $79

    Passwarden PW Manager Lifetime Subscription for $79
  • VPN Unlimited: Lifetime Subscription for $89

    VPN Unlimited: Lifetime Subscription for $89
  • Dell Latitude 5401 14" Laptop i5-9400H 2.5GHz 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $399

    Dell Latitude 5401 14" Laptop i5-9400H 2.5GHz 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $399
  • Dell OptiPlex 7060 Micro Desktop Core i7-8700T 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $369

    Dell OptiPlex 7060 Micro Desktop Core i7-8700T 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $369
  • Anker 313 Power Bank (PowerCore 10K) for $27

    Anker 313 Power Bank (PowerCore 10K) for $27
  • News
    • Rumor
    • Design
    • Concept
    • WWDC
    • Security
    • BigData
  • Apps
    • Free Apps
    • OS X
    • iOS
    • iTunes
      • Music
      • Movie
      • Books
  • How to
    • OS X
      • OS X Mavericks
      • OS X Yosemite
      • Where Download OS X 10.9 Mavericks
    • iOS
      • iOS 7
      • iOS 8
      • iPhone Firmware
      • iPad Firmware
      • iPod touch
      • AppleTV Firmware
      • Where Download iOS 7 Beta
      • Jailbreak News
      • iOS 8 Beta/GM Download Links (mega links) and How to Upgrade
      • iPhone Recovery Mode
      • iPhone DFU Mode
      • How to Upgrade iOS 6 to iOS 7
      • How To Downgrade From iOS 7 Beta to iOS 6
    • Other
      • Disable Apple Remote Control
      • Pair Apple Remote Control
      • Unpair Apple Remote Control
  • Special Offers
  • Contact us

How to add support for more languages in your Elastic Enterprise Search engines

Apr 26, 2023 by iHash Leave a Comment


How to add support for more languages in your Elastic Enterprise Search engines

enterprise-search-site-search-light-1680x980.png

Engines in Elastic App Search enable you to index documents and provide out-of-the-box, tunable search capabilities. By default, engines support a predefined list of languages. If your language is not on that list, this blog explains how you can add support for additional languages. We’ll do this by creating an App Search engine that has analyzers set up for that language.

Before we dive into the details, let’s define what an Elasticsearch analyzer is:

An Elasticsearch analyzer is a package that contains three lower-level building blocks: character filters, tokenizers, and token filters. Analyzers can be built-in or custom. Built-in analyzers pre-package building blocks into analyzers suitable for different languages and types of text.

Analyzers for each field are used to:

  • Index. Each document field will be processed with its corresponding analyzer and broken down into tokens to ease searching.
  • Search. The search query will be analyzed to ensure a proper match with the indexed fields that have already been analyzed.

Elasticsearch index-based engines enable you to create App Search engines from existing Elasticsearch indices. We’ll create an Elasticsearch index with our own analyzers and mappings and use that index in App Search.

There are four steps in this process:

  1. Create an Elasticsearch index and index documents
  2. Add language analyzers to that index
  3. Update the index mapping to use analyzers
  4. Reindex the documents

1. Create an Elasticsearch index and index documents

To get started, let’s take an index that has not been optimized for any language. Let’s assume this is a new index that has no predefined mappings and it is created when documents are indexed for the first time.

In Elasticsearch, mapping is the process that defines how a document and the fields it contains are stored and indexed. Each document is a collection of fields, which each have their own data type. When mapping your data you create a mapping definition, which contains a list of fields pertinent to the document.

Back to our example. The index is called books, where the title is in the Romanian language. We picked Romanian because it is my language and it is not included in the list of languages that App Search supports by default.  

POST books/_doc/1
{
  "title": "Un veac de singurătate",
  "author": "Gabriel García Márquez"
}

POST books/_doc/2
{
  "title": "Dragoste în vremea holerei",
  "author": "Gabriel García Márquez"
}

POST books/_doc/3
{
  "title": "Obosit de viaţă, obosit de moarte",
  "author": "Mo Yan"
}

POST books/_doc/4
{
  "title": "Maestrul și Margareta",
  "author": "Mihail Bulgakov"
}Read more

2. Add language analyzers to the books index

When we inspect the books index mapping, we see it is not optimized for Romanian. You can tell as there is no analysis field in the settings block and the text fields do not use a custom analyzer.

GET books
{
  "books": {
    "aliases": {},
    "mappings": {
      "properties": {
        "author": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    },
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "books",
        "creation_date": "1679310576178",
        "number_of_replicas": "1",
        "uuid": "0KuiDk8iSZ-YHVQGg3B0iw",
        "version": {
          "created": "8080099"
        }
      }
    }
  }
}Read more

If we try to create an App Search engine with the books index, we’ll have two issues. First, search results will not be optimized for Romanian, and next, features like Precision Tuning will be disabled.

A quick note on different types of Elastic App Search engines:

  • The default option is an App Search managed engine, which will automatically create and manage a hidden Elasticsearch index. With this option, you have to use the App Search documents API to ingest data in your engine.
  • With the other option, App Search creates an engine with an existing Elasticsearch index — in this case, App Search will use the index as it is. Here, you can ingest data directly in the underlying index by using the Elasticsearch index documents API.

[Related article: Elasticsearch Search API: A new way to locate App Search documents]

When you create an engine from an existing Elasticsearch index, if the mappings do not follow App Search conventions not all features will be enabled for that engine. Let’s have a closer look at App Search mapping conventions by looking at an engine that is fully managed by App Search. This engine has two fields, title and author, and uses the English language.

GET .ent-search-engine-documents-app-search-books/_mapping/field/title
{
  ".ent-search-engine-documents-app-search-books": {
    "mappings": {
      "title": {
        "full_name": "title",
        "mapping": {
          "title": {
            "type": "text",
            "fields": {
              "date": {
                "type": "date",
                "format": "strict_date_time||strict_date",
                "ignore_malformed": true
              },
              "delimiter": {
                "type": "text",
                "index_options": "freqs",
                "analyzer": "iq_text_delimiter"
              },
              "enum": {
                "type": "keyword",
                "ignore_above": 2048
              },
              "float": {
                "type": "double",
                "ignore_malformed": true
              },
              "joined": {
                "type": "text",
                "index_options": "freqs",
                "analyzer": "i_text_bigram",
                "search_analyzer": "q_text_bigram"
              },
              "location": {
                "type": "geo_point",
                "ignore_malformed": true,
                "ignore_z_value": false
              },
              "prefix": {
                "type": "text",
                "index_options": "docs",
                "analyzer": "i_prefix",
                "search_analyzer": "q_prefix"
              },
              "stem": {
                "type": "text",
                "analyzer": "iq_text_stem"
              }
            },
            "index_options": "freqs",
            "analyzer": "iq_text_base"
          }
        }
      }
    }
  }
}Read more

You’ll see the title field has several subfields. The date, float, and location subfields are not text fields.

Here, we are interested in how to set the text fields that App Search requires. There are more than a few fields! This documentation page explains the text fields used in App Search. Let’s look at the analyzers that App Search sets for a hidden index belonging to an App Search managed engine:

GET .ent-search-engine-documents-app-search-books/_settings/index.analysis*
{
  ".ent-search-engine-documents-app-search-books": {
    "settings": {
      "index": {
        "analysis": {
          "filter": {
            "front_ngram": {
              "type": "edge_ngram",
              "min_gram": "1",
              "max_gram": "12"
            },
            "bigram_joiner": {
              "max_shingle_size": "2",
              "token_separator": "",
              "output_unigrams": "false",
              "type": "shingle"
            },
            "bigram_max_size": {
              "type": "length",
              "max": "16",
              "min": "0"
            },
            "en-stem-filter": {
              "name": "light_english",
              "type": "stemmer"
            },
            "bigram_joiner_unigrams": {
              "max_shingle_size": "2",
              "token_separator": "",
              "output_unigrams": "true",
              "type": "shingle"
            },
            "delimiter": {
              "split_on_numerics": "true",
              "generate_word_parts": "true",
              "preserve_original": "false",
              "catenate_words": "true",
              "generate_number_parts": "true",
              "catenate_all": "true",
              "split_on_case_change": "true",
              "type": "word_delimiter_graph",
              "catenate_numbers": "true",
              "stem_english_possessive": "true"
            },
            "en-stop-words-filter": {
              "type": "stop",
              "stopwords": "_english_"
            }
          },
          "analyzer": {
            "i_prefix": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "front_ngram"
              ],
              "tokenizer": "standard"
            },
            "iq_text_delimiter": {
              "filter": [
                "delimiter",
                "cjk_width",
                "lowercase",
                "asciifolding",
                "en-stop-words-filter",
                "en-stem-filter"
              ],
              "tokenizer": "whitespace"
            },
            "q_prefix": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding"
              ],
              "tokenizer": "standard"
            },
            "iq_text_base": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "en-stop-words-filter"
              ],
              "tokenizer": "standard"
            },
            "iq_text_stem": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "en-stop-words-filter",
                "en-stem-filter"
              ],
              "tokenizer": "standard"
            },
            "i_text_bigram": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "en-stem-filter",
                "bigram_joiner",
                "bigram_max_size"
              ],
              "tokenizer": "standard"
            },
            "q_text_bigram": {
              "filter": [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "en-stem-filter",
                "bigram_joiner_unigrams",
                "bigram_max_size"
              ],
              "tokenizer": "standard"
            }
          }
        }
      }
    }
  }
}Read more

If we want to create an index that we can use in App Search, for a different language — for example Norwegian, Finnish, or Arabic — we would need similar analyzers. For our example, we need to ensure that the stem and stopword filters use the Romanian version.

Going back to our initial books index, let’s add the right analyzers.

A quick word of caution here. For existing indices, analyzers are a type of Elasticsearch setting that can only be changed when an index is closed. In this approach, we start with an existing index and therefore need to close the index, add analyzers, and then reopen the index.

Note: As an alternative, you could also recreate the index from scratch with the right mappings and then index all the documents. If that is better for your use case, feel free to skip the parts of this guide that discuss opening and closing the index, and reindexing.

You can close the index by running POST books/_close. And after that, we will add the analyzers:

PUT books/_settings
{
  "analysis": {
    "filter": {
      "front_ngram": {
        "type": "edge_ngram",
        "min_gram": "1",
        "max_gram": "12"
      },
      "bigram_joiner": {
        "max_shingle_size": "2",
        "token_separator": "",
        "output_unigrams": "false",
        "type": "shingle"
      },
      "bigram_max_size": {
        "type": "length",
        "max": "16",
        "min": "0"
      },
      "ro-stem-filter": {
        "name": "romanian",
        "type": "stemmer"
      },
      "bigram_joiner_unigrams": {
        "max_shingle_size": "2",
        "token_separator": "",
        "output_unigrams": "true",
        "type": "shingle"
      },
      "delimiter": {
        "split_on_numerics": "true",
        "generate_word_parts": "true",
        "preserve_original": "false",
        "catenate_words": "true",
        "generate_number_parts": "true",
        "catenate_all": "true",
        "split_on_case_change": "true",
        "type": "word_delimiter_graph",
        "catenate_numbers": "true"
      },
      "ro-stop-words-filter": {
        "type": "stop",
        "stopwords": "_romanian_"
      }
    },
    "analyzer": {
      "i_prefix": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding",
          "front_ngram"
        ],
        "tokenizer": "standard"
      },
      "iq_text_delimiter": {
        "filter": [
          "delimiter",
          "cjk_width",
          "lowercase",
          "asciifolding",
          "ro-stop-words-filter",
          "ro-stem-filter"
        ],
        "tokenizer": "whitespace"
      },
      "q_prefix": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding"
        ],
        "tokenizer": "standard"
      },
      "iq_text_base": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding",
          "ro-stop-words-filter"
        ],
        "tokenizer": "standard"
      },
      "iq_text_stem": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding",
          "ro-stop-words-filter",
          "ro-stem-filter"
        ],
        "tokenizer": "standard"
      },
      "i_text_bigram": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding",
          "ro-stem-filter",
          "bigram_joiner",
          "bigram_max_size"
        ],
        "tokenizer": "standard"
      },
      "q_text_bigram": {
        "filter": [
          "cjk_width",
          "lowercase",
          "asciifolding",
          "ro-stem-filter",
          "bigram_joiner_unigrams",
          "bigram_max_size"
        ],
        "tokenizer": "standard"
      }
    }
  }
}Read more

You can see that we’re adding the ro-stem-filter for stemming in Romanian, which will improve search relevance for word variations that are specific to Romanian. We’re including the Romanian stop words filter (ro-stop-words-filter) to make sure Romanian stop words are not considered for searching purposes.

And now we will reopen the index by executing POST books/_open.

3. Update index mapping to use analyzers

Once we have the analysis settings in place, we can modify the index mapping. App Search uses dynamic templates to ensure that new fields have the right subfields and analyzers. For our example, we will only add the subfields to the existing title and author fields:

PUT books/_mapping
{
  "properties": {
    "author": {
      "type": "text",
      "fields": {
        "delimiter": {
          "type": "text",
          "index_options": "freqs",
          "analyzer": "iq_text_delimiter"
        },
        "enum": {
          "type": "keyword",
          "ignore_above": 2048
        },
        "joined": {
          "type": "text",
          "index_options": "freqs",
          "analyzer": "i_text_bigram",
          "search_analyzer": "q_text_bigram"
        },
        "prefix": {
          "type": "text",
          "index_options": "docs",
          "analyzer": "i_prefix",
          "search_analyzer": "q_prefix"
        },
        "stem": {
          "type": "text",
          "analyzer": "iq_text_stem"
        }
      }
    },
    "title": {
      "type": "text",
      "fields": {
        "delimiter": {
          "type": "text",
          "index_options": "freqs",
          "analyzer": "iq_text_delimiter"
        },
        "enum": {
          "type": "keyword",
          "ignore_above": 2048
        },
        "joined": {
          "type": "text",
          "index_options": "freqs",
          "analyzer": "i_text_bigram",
          "search_analyzer": "q_text_bigram"
        },
        "prefix": {
          "type": "text",
          "index_options": "docs",
          "analyzer": "i_prefix",
          "search_analyzer": "q_prefix"
        },
        "stem": {
          "type": "text",
          "analyzer": "iq_text_stem"
        }
      }
    }
  }
}Read more

The books index is now almost ready to be used in App Search!

We just need to ensure that the documents we indexed before we modified the mapping have all the right subfields. To do this, we can run a reindex in place using update_by_query:

POST books/_update_by_query?refresh
{
  "query": {
    "match_all": {
    }
  }
}

Since we are using a match_all query, all existing documents will be updated.

With an update by query request, we can also include a script parameter to define how to update the documents.

Note that we are not changing the documents, but we do want to reindex existing documents as they are to ensure that the text fields author and title have the right subfields. Hence, we do not need to include a script in our update by query request.

We now have a language-optimized index we can use in App Search with Elasticsearch engines! You’ll see the benefits in action in the following screenshots.

We’ll use the book title One Hundred Years of Solitude as reference. The translated title in Romanian is Un veac de singurătate. Pay attention to the word veac, which is the Romanian word for “century.” We’ll run a search with the plural form of veac, which is veacuri. We ingested this data record in both the examples we’re about to look at:

{
  "title": "Un veac de singurătate",
  "author": "Gabriel García Márquez"
}

When an index is not optimized for a language, the Romanian book title Un veac de singurătate is indexed with the standard analyzer, which works well for most languages but might not always match on relevant documents. Searching for veacuri does not show any results, as this search input does not match any plain text in the data record.

relevance tuning manage fields

When using the language-optimized index, however, when we search for veacuri, Elastic App Search matches that to the Romanian language word veac and returns the data we’re looking for. Precision tuning fields are also available within the Relevance Tuning view! Look at all the highlighted bits in this image:

relevance tuning precision tuning

So, with that, we added support in Elastic Enterprise Search for Romanian, which is my language! The process used in this guide can be replicated to create indices optimized for any other language supported by Elasticsearch. For the full list of supported language analyzers in Elasticsearch, have a look at this documentation page.

Analyzers in Elasticsearch is a fascinating topic. If you’re interested in learning more, here are a few other resources: 

icon-toc-16-blue.svg



Source link

Share this:

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn

Filed Under: News Tagged With: Add, Elastic, Engines, Enterprise, languages, search, Support

Special Offers

  • Passwarden PW Manager Lifetime Subscription for $79

    Passwarden PW Manager Lifetime Subscription for $79
  • VPN Unlimited: Lifetime Subscription for $89

    VPN Unlimited: Lifetime Subscription for $89
  • Dell Latitude 5401 14" Laptop i5-9400H 2.5GHz 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $399

    Dell Latitude 5401 14" Laptop i5-9400H 2.5GHz 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $399
  • Dell OptiPlex 7060 Micro Desktop Core i7-8700T 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $369

    Dell OptiPlex 7060 Micro Desktop Core i7-8700T 16GB RAM 512GB SSD Windows 10 Pro (Refurbished) for $369
  • Anker 313 Power Bank (PowerCore 10K) for $27

    Anker 313 Power Bank (PowerCore 10K) for $27

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

  • Facebook
  • GitHub
  • Instagram
  • Pinterest
  • Twitter
  • YouTube

More to See

Heard on the Street – 6/5/2023

Jun 5, 2023 By iHash

Passwarden PW Manager Lifetime Subscription for $79

Jun 5, 2023 By iHash

Tags

* Apple Cisco computer security cyber attacks cyber crime cyber news cybersecurity Cyber Security cyber security news cyber security news today cyber security updates cyber threats cyber updates data data breach data breaches google hacker hacker news Hackers hacking hacking news how to hack incident response information security iOS 7 iOS 8 iPhone Malware microsoft network security ransomware ransomware malware risk management Secure security security breaches security vulnerabilities software vulnerability the hacker news Threat update video web applications

Latest

Dotan Horovits

From Spotify to Open Source: The Backstory of Backstage

Technology juggernauts–despite their larger staffs and budgets–still face the “cognitive load” for DevOps that many organizations deal with day-to-day. That’s what led Spotify to build Backstage, which supports DevOps and platform engineering practices for the creation of developer portals. Eventually, Spotify made the decision to open source Backstage and donate it to the Cloud Native […]

VPN Unlimited: Lifetime Subscription for $89

Expires June 04, 2024 23:59 PST Buy now and get 55% off KEY FEATURES VPN Unlimited is the ultimate solution for enhancing your online security and privacy. With top-notch encryption algorithms and over 3000 secure servers in 80+ locations worldwide, it effectively masks your IP address and protects your sensitive information from prying eyes. Whether […]

Magento, WooCommerce, WordPress, and Shopify Exploited in Web Skimmer Attack

Jun 05, 2023Ravie LakshmananWebsite Security / Magecart Cybersecurity researchers have unearthed a new ongoing Magecart-style web skimmer campaign that’s designed to steal personally identifiable information (PII) and credit card data from e-commerce websites. A noteworthy aspect that sets it apart from other Magecart campaigns is that the hijacked sites further serve as “makeshift” command-and-control (C2) […]

Anker 313 Power Bank (PowerCore 10K) for $27

Expires May 30, 2123 05:12 PST Buy now and get 0% off PRODUCT SPECS Quality Throughout:Built adventure-tier touch with superior durability and scratch resistance, Anker 313 Power Bank (PowerCore 10K) is premium both inside and out. Slim Size, Big Power:One of the slimmest and lightest 10,000mAh portable chargers on the market. Provides 2.25 charges for […]

Secrets to successful engineering leadership from Elastic’s Simona Posea

Secrets to successful engineering leadership from Elastic’s Simona Posea

For Simona Posea, visualizing a career in technology came easily. “My mother was a role model, as she’s been working with computers since punch cards were a thing,” Simona laughs. “We had our first PC when I was six years old, and she taught us how to operate it. At the time, I would use […]

Video Highlights: Fine Tune GPT-J 6B in Under 3 Hours on IPUs

Did you know you can run GPT-J 6B on Graphcore IPU in the cloud? Following the now infamous leaked Google memo, there’s been a real storm in the AI world recently around smaller, open source language models, like GPT-J, that are cheaper and faster to fine-tune, run and perform just as well as larger models […]

Jailbreak

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

  Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate.   Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

  Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Copyright iHash.eu © 2023
We use cookies on this website. By using this site, you agree that we may store and access cookies on your device. Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT