How we implemented frequent item set mining in Elasticsearch

Apr 4, 2023 by iHash Leave a Comment

Choosing the base algorithm

Most famous and best known is the Apriori algorithm. Apriori builds candidate item sets breath first. It starts with building sets containing only one item and then expanding those sets in every iteration by one more item. After sets have been generated, they are tested against the data. Infrequent sets — those that do not reach a certain support, defined upfront — are pruned before the next iteration. Pruning might remove a lot of candidates, but the biggest weakness of this approach remains the requirement to keep a lot of item set candidates in memory.

Although the first prototypes of the aggregation used Apriori, it was clear from the beginning that we wanted to switch the algorithm later. We looked for one that better scales in runtime and memory. We decided on Eclat, other alternatives are FP-Growth and LCM. All three use a depth-first approach, which fits our resource model much better. Christian Borgelt’s overview paper has details about the various approaches and differences.

Fields and values

An Elasticsearch index consists of documents with fields and values. Values have different types, and each field can be an array of values. Translated to frequent item sets, a single item consists of exactly one field and one value. If a field stores an array of values, frequent_item_sets treats every value in the array as a single item. In other words, a document is a set of items. Yet not all fields are of interest; only the subset of fields used for frequent_item_sets is a transaction.

Dealing with distributed storage

Beyond choosing the main algorithm, other details required attention. The input data for an aggregation can be in one or many indices further separated in shards. In other words, data isn’t stored in one central place. This sounds like a weakness at first, but it has an advantage. At the shard level execution happens in parallel, so it makes sense to put as much as possible into the mapping phase.

Data preparation and mining basics

During mapping, items and transactions get de-duplicated. To reduce size, we encode items and transactions in big tables together with a counter. That counter later helps us to reduce runtime.

Once all shards have sent data to the coordinating node, the reduce phase starts with merging all shard results. In contrast to other aggregations, the main task of frequent_item_sets starts. Most of the runtime gets spent on generating and testing sets.

After the results are merged, we have a global view and can prune items. An item with a lower count than a minimum count gets dropped. Transactions might collapse as a result of item pruning. We calculate the minimum count using the minimum support parameter and the total document count:

Source link

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

April 2024 Patch Tuesday: Updates and Analysis

Microsoft has released security updates for 150 vulnerabilities in its April 2024 Patch Tuesday rollout, a much larger amount than in recent months. There are three Critical remote code execution vulnerabilities (CVE-2024-21322, CVE-2024-21323 and CVE-2024-29053), all of which are related to Microsoft Defender for IoT, Microsoft’s security platform for IoT devices. April 2024 Risk Analysis […]

CrowdStrike Extends Identity Security Capabilities to Stop Attacks in the Cloud

Two recent Microsoft breaches underscore the growing problem of cloud identity attacks and why it’s critical to stop them. While Microsoft Active Directory (AD) remains a prime target for attackers, cloud identity stores such as Microsoft Entra ID are also a target of opportunity. The reason is simple: Threat actors increasingly seek to mimic legitimate […]

How Lack of Knowledge Among Teams Impacts Observability

Without a doubt, you’ve heard about the persistent talent gap that has troubled the technology sector in recent years. It’s a problem that isn’t going away, plaguing everyone from engineering teams to IT security pros, and if you work in the industry today you’ve likely experienced it somewhere within your own teams. Despite major changes […]

What You Need to Know About the Critical PAN-OS Zero-Day

UPDATE: It has been confirmed that disabling telemetry will not block this exploit. Applying a patch as soon as possible is the most effective remediation for this vulnerability. Patches for 8 of the 18 vulnerable versions have been released; patches for the remaining vulnerable versions are expected by April 19th. CrowdStrike is constantly working to […]

Artificial Intelligence Means Smaller Teams Doing More with Less Makes the Small Autonomous Teams Structure Even More Important

The artificial intelligence wave that we’ve seen hit the news is one step in a long line of innovations that technologists have been working on for years. And this technology, like other technologies, will not eliminate jobs in the way that people fear. Rather, it’s like electricity—it will enable people to do more and to […]

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Expires April 25, 2124 07:59 PST Buy now and get 20% off KEY FEATURES Experience Open Ear Listening with Dashlyte Dashlyte open-ear headphones are the perfect blend of premium sound and functionality. These headphones keep you aware of your surroundings while providing clear, high-quality audio. Their comfortable, lightweight design and secure fit make them perfect […]

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate. Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Autio Unlimited Plan for $39

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Refurbished Lenovo Chromebook N22 Intel Celeron N3060 1.6GHz 4GB RAM 16GB SSD 11.6" LED Grade A+ for $66

Beats Studio Pro Wireless Noise Cancelling Headphones – Sandstone (New – Open Box) for $179

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade A+ for $260

How we implemented frequent item set mining in Elasticsearch

Choosing the base algorithm

Fields and values

Dealing with distributed storage

Data preparation and mining basics

Autio Unlimited Plan for $39

Dashlyte Bluetooth Bone-Conduction LED-Neckband Headphones with Microphone for $79

Refurbished Lenovo Chromebook N22 Intel Celeron N3060 1.6GHz 4GB RAM 16GB SSD 11.6" LED Grade A+ for $66

Beats Studio Pro Wireless Noise Cancelling Headphones – Sandstone (New – Open Box) for $179

Refurbished Apple iPhone 11 Fully Unlocked Black / 64GB / Grade A+ for $260

Choosing the base algorithm

Fields and values

Dealing with distributed storage

Data preparation and mining basics

Share this:

Reader Interactions

Leave a ReplyCancel reply