Improving information retrieval in the Elastic Stack: Steps to improve search relevance

Feb 19, 2023 by iHash Leave a Comment

Since 8.0 and the release of third-party natural language processing (NLP) models for text embeddings, users of the Elastic Stack have access to a wide variety of models to embed their text documents and perform query-based information retrieval using vector search.

Given all these components and their parameters, and depending on the text corpus you want to search in, it can be overwhelming to choose which settings will give the best search relevance.

In this series of blog posts, we will introduce a number of tests we ran using various publicly available data sets and information retrieval techniques that are available in the Elastic Stack. We’ll then provide recommendations of the best techniques to use depending on the setup.

To kick off this series of blogs, we want to set the stage by describing the problem we are addressing and describe some methods we will dig further into in subsequent blogs.

Background and terminology

BM25: A sparse, unsupervised model for lexical search

The classic way documents are ranked for relevance by Elasticsearch according to a text query uses the Lucene implementation of the Okapi BM25 model. Although a few hyperparameters of this model were fine-tuned to optimize the results in most scenarios, this technique is considered unsupervised as labeled queries and documents are not required to use it: it’s very likely that the model will perform reasonably well on any corpus of text, without relying on annotated data. BM25 is known to be a strong baseline in zero-shot retrieval settings.

Under the hood, this kind of model builds a matrix of term frequencies (how many times a term appears in each document) and inverse document frequencies (inverse of how many documents contain each term). It then scores each query term for each document that was indexed based on those frequencies. Because each document typically contains a small fraction of all words used in the corpus, the matrix contains a lot of zeros. This is why this type of representation is called sparse.

Also, this model sums the relevance score of each individual term within a query for a document, without taking into account any semantic knowledge (synonyms, context, etc.). This is called lexical search (as opposed to semantic search). Its shortcoming is the so-called vocabulary mismatch problem, that query vocabulary is slightly different to the document vocabulary. This motivates other scoring models that try to incorporate semantic knowledge to avoid this problem.

Dense models: A dense, supervised model for semantic search

More recently, transformer-based models have allowed for a dense, context aware representation of text, addressing the principal shortcomings mentioned above.

To build such models, the following steps are required:

1. Pre-training
We first need to train a neural network to understand the basic syntax of natural language.

Using a huge corpus of text, the model learns semantic knowledge by training on unsupervised tasks (like Masked Word Prediction or Next Sentence Prediction).
BERT is probably the best known example of these models — it was trained on Wikipedia (2.5B words) and BookCorpus (800M words) using Masked Word Prediction.

This is called pre-training. The model learns vector representations of language tokens, which can be adapted for other tasks with much less training.

Source link

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Expires April 25, 2124 07:59 PST Buy now and get 20% off KEY FEATURES Experience Open Ear Listening with Dashlyte Dashlyte open-ear headphones are the perfect blend of premium sound and functionality. These headphones keep you aware of your surroundings while providing clear, high-quality audio. Their comfortable, lightweight design and secure fit make them perfect […]

Refurbished Lenovo Chromebook N22 Intel Celeron N3060 1.6GHz 4GB RAM 16GB SSD 11.6" LED Grade A+ for $66

Expires April 23, 2124 18:23 PST Buy now and get 66% off KEY FEATURES Whether it’s sharing a project on a screen or watching a great movie, the Lenovo N22 Chromebook delivers enhanced browsing and streaming. This laptop is powered by an Intel® Celeron processor and a battery that can run all day and long […]

Charlotte AI’s Multi-AI Approach | CrowdStrike

Over the last year there has been a prevailing sentiment that while AI will not necessarily be replacing humans, humans who use AI will replace those that don’t. This sentiment also applies to the next era of cybersecurity, which has been rapidly unfolding over the last year. Recent breakthroughs in generative AI hold enormous promise […]

Porter Airlines Cybersecurity Consolidation | CrowdStrike

As Porter Airlines scaled its business, it needed a unified cybersecurity platform to eliminate the challenges of juggling multiple cloud, identity and endpoint security products. Porter consolidated its cybersecurity strategy with the single-agent, single-console architecture of the AI-native CrowdStrike Falcon® XDR platform. With the Falcon platform, the airline has reduced cost and complexity while driving […]

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade A+ for $260

Expires April 09, 2124 10:16 PST Buy now and get 56% off KEY FEATURES Get your hands on this powerful, feature-packed Apple iPhone 11. Shoot 4K videos, beautiful portraits and sweeping landscapes with the all-new dual-camera system. Capture your best low-light photos with night mode. See true-to-life color in your photos, videos and games on […]

Exploring 11 popular machine learning algorithms

Over the past few years, machine learning (ML) has quietly become an integral part of our daily lives. It impacts everything from personalized recommendations on shopping and streaming sites to protecting our inboxes from the onslaught of spam we get every day. But it’s not purely a tool for our convenience. Machine learning has become […]

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate. Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Refurbished Lenovo Chromebook N22 Intel Celeron N3060 1.6GHz 4GB RAM 16GB SSD 11.6" LED Grade A+ for $66

Beats Studio Pro Wireless Noise Cancelling Headphones – Sandstone (New – Open Box) for $179

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade A+ for $260

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade B for $242

Improving information retrieval in the Elastic Stack: Steps to improve search relevance

Background and terminology

BM25: A sparse, unsupervised model for lexical search

Dense models: A dense, supervised model for semantic search

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Refurbished Lenovo Chromebook N22 Intel Celeron N3060 1.6GHz 4GB RAM 16GB SSD 11.6" LED Grade A+ for $66

Beats Studio Pro Wireless Noise Cancelling Headphones – Sandstone (New – Open Box) for $179

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade A+ for $260

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade B for $242

Background and terminology

BM25: A sparse, unsupervised model for lexical search

Dense models: A dense, supervised model for semantic search

Share this:

Reader Interactions

Leave a ReplyCancel reply