• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • Contact Us

iHash

News and How to's

  • Dell OptiPlex 7010 RGB Desktop Quad Core Intel i5 (3.2GHz) 8GB DDR3 RAM 250GB SSD Windows 10 Pro (Refurbished) for $162

    Dell OptiPlex 7010 RGB Desktop Quad Core Intel i5 (3.2GHz) 8GB DDR3 RAM 250GB SSD Windows 10 Pro (Refurbished) for $162
  • Dell OptiPlex 5040 (RGB) Desktop Quad Core Intel i5 (3.2GHz) 16GB DDR3 RAM 500GB SSD Windows 10 Pro (Refurbished) for $249

    Dell OptiPlex 5040 (RGB) Desktop Quad Core Intel i5 (3.2GHz) 16GB DDR3 RAM 500GB SSD Windows 10 Pro (Refurbished) for $249
  • Zerrio: The Ultimate All-In-One Business Management Toolkit (Lifetime Subscription) for $59

    Zerrio: The Ultimate All-In-One Business Management Toolkit (Lifetime Subscription) for $59
  • DNS FireWall: Lifetime Subscription for $59

    DNS FireWall: Lifetime Subscription for $59
  • KeepSolid SmartDNS: Lifetime Subscription for $59

    KeepSolid SmartDNS: Lifetime Subscription for $59
  • News
    • Rumor
    • Design
    • Concept
    • WWDC
    • Security
    • BigData
  • Apps
    • Free Apps
    • OS X
    • iOS
    • iTunes
      • Music
      • Movie
      • Books
  • How to
    • OS X
      • OS X Mavericks
      • OS X Yosemite
      • Where Download OS X 10.9 Mavericks
    • iOS
      • iOS 7
      • iOS 8
      • iPhone Firmware
      • iPad Firmware
      • iPod touch
      • AppleTV Firmware
      • Where Download iOS 7 Beta
      • Jailbreak News
      • iOS 8 Beta/GM Download Links (mega links) and How to Upgrade
      • iPhone Recovery Mode
      • iPhone DFU Mode
      • How to Upgrade iOS 6 to iOS 7
      • How To Downgrade From iOS 7 Beta to iOS 6
    • Other
      • Disable Apple Remote Control
      • Pair Apple Remote Control
      • Unpair Apple Remote Control
  • Special Offers
  • Contact us

How We Trained Overfit Models to Identify Malicious Activity

Jul 30, 2020 by iHash Leave a Comment


In this blog, we present the results of some preliminary experiments with training highly “overfit” (interpolated) models to identify malicious activity based on behavioral data. These experiments were inspired by an expanding literature that questions the traditional approach to machine learning, which has sought to avoid overfitting in order to encourage model generalization. 

The results discussed here, although preliminary, validate that the insights of this new literature seem to apply to our problem domain, and that perhaps training highly overfit models on our behavioral data will yield the best possible performance. 

Table of Contents

  • The Traditional Approach to Generalization
  • Avoiding Overfitting May Not Always Help
  • Interpolation and Our Data
  • The Problem Space
    • Our Experiments
  • The Surprising (or Not?) Performance of Our Interpolated Models
  • Results
  • Comparison to Other Models
  • Conclusions and Next Steps
      • Citations and Sources
    • Additional Resources

The Traditional Approach to Generalization

At CrowdStrike, we are constantly training, experimenting on and deploying new machine learning models as part of our efforts to identify and prevent malicious activity. A critical concern in these efforts is training models that generalize well — that is, models that perform well not just on the data the model is trained over but also on data not included in the training set. In the cybersecurity setting, generalization is what allows a model to detect never-before-seen malware and learn latent characteristics identifying malicious activities.

Encouraging generalization is often thought of in terms of its (almost) converse: overfitting. A model is thought of as overfit when it learns spurious rules that leverage the data’s high dimensionality to perfectly fit its training labels (note this definition is inherently imprecise, as whether a learned rule is overfit or “true” is in many ways based on interpretation). The more complexity and degrees of freedom a model has, the more ability it has to overfit the data. 

graph with green line and red and blue dots

Figure 1: Overfit model (green line) vs generalizable model (black line) in a binary classification problem (Image source: https://en.wikipedia.org/wiki/Overfitting)

Figure 1 is a classic example of overfitting. The green and black lines each represent a model trained to classify the red and blue data points. The green line, while perfectly fitting the data, is thought to be overfit, while the smoother black line presumably has better generalization performance. 

Traditionally, statistics and machine learning think of improving generalization (given a fixed data set) and avoiding overfitting as one and the same. Techniques to avoid overfitting are broadly known as regularization, because often they operate by enforcing simpler models or penalizing complexity in the course of training. The relationship between overfitting and generalization is often thought of in terms of the bias/variance trade-off:

chart with blue tan and yellow lines

Figure 2. The classic bias-variance trade-off: As model complexity increases, performance on the training data continues to improve, but is eventually overshadowed by degradation in generalization performance. (Source: Dankers, Traverso, Wee et al. [1])

Avoiding Overfitting May Not Always Help

The quickly expanding literature has raised many questions about the universality of the bias-variance trade-off and the traditional understanding that preventing overfitting is a requirement of out-of-sample performance. While this literature was originally born in the deep learning for image classification space, it has been shown to apply to an ever-wider range of problems. While we won’t attempt a review of the literature here, we will summarize some key points:

  1. Across many problem domains, models that heavily overfit the training data perform better than the best models that do not. This observation has been replicated across many problem domains and model architectures, including the kinds of boosted tree models experimented with here. ([2],[3],[4],[5],[6]
  2. )Performance on held-out data (aka generalization performance) seems to exhibit a double dip, wherein the generalization performance degrades as a model and becomes overfit (the usual bias-variance trade-off), but improves again as the model continues to overfit the training data (the interpolating regime).
    two charts with solid and dotted lines

    Figure 3. Curves for training risk (dashed line) and test risk (solid line). (a) The classical U-shaped risk curve arising from the bias-variance trade-off. (b) The double descent risk curve, which incorporates the U-shaped risk curve (i.e, the “classical” regime) together with the observed behavior from using high-capacity function classes (i.e., the “modern” interpolating regime), separated by the interpolation threshold. The predictors to the right of the interpolation threshold have zero training risk. (Source: Belkin, Hsu, Ma, and Mandal [3])

  3. The extent of these effects is driven in a complex and not well-understood way by the underlying complexity of the data, capacity of the model and relative size of the training data set. This has resulted in seemingly counterintuitive results such as better models being trained from smaller data sets.
  4. The gains from training overfit models may require that the true data distribution is long-tailed and the trained models are sufficiently complex. ([6])

Interpolation and Our Data

Of course, our first thought is, how does this new literature apply to our problem? Cybersecurity data exhibits many of the characteristics that may be important in determining the effectiveness of regularization vs. overfit models: 

  1. Cybersecurity data is extremely “long-tailed”: The distribution of behaviors and files and other data has immense variety, and it is common to see relatively uncommon/unique data points.
  2. For many cybersecurity problems, it is hard to know if even extremely large data sets are truly large relative to the inherent data complexity.
  3. Many of our models have the capacity to perfectly fit training data.

However, because this literature is so new and still quite active, it is far from clear if our model generalization can benefit from training to highly overfit. Our goal with the experiments discussed next is to begin to answer this question. 

The Problem Space

We focus here on the specific task of training a boosted tree model on process-level behavioral data to identify malicious activity. This behavioral data consists of a huge variety of events attempting to capture everything that a process did on a system. These descriptions of behaviors can be aggregated together for a single process and turned into numeric features. The result is a feature vector where each numeric value represents a separate feature. As an example, a feature vector like  (5.7, 1, 0, 3, 4, 1, …, 56.789, …) might mean that the process ran for 5.7 seconds of kernel time, created a user account, did not write any files, made three DNS requests, etc.

In our specific case, our feature vector has on the order of 55,000 features. As a result, even though the space of possible behaviors is vast, a priori it appears likely that our model is sufficiently complex to perfectly fit our training data. 

We were motivated to experiment with these particular models primarily because they exhibit some of the characteristics that the literature has suggested play a role in determining whether generalization can be improved in the interpolating regime:

  1. Behavioral data is long-tailed.
  2. Our training data set is necessarily small, relative to the extremely long tail of possible behavior.
  3. We train complex models with relatively high dimensions of features.

Our Experiments

We experiment with training memorized/overfit versions of a boosted tree model trained over the behavioral data described above. In our typical training regime, we actively attempt to avoid overfitting. In the case of the models here, that is primarily accomplished by including regularization terms in the loss functions, restricting the depths of trees, and early stopping (carefully monitoring model performance on held-out data, and ending training once the model is no longer improving performance on the holdout set).

For this experiment:

  1. We remove all regularizing parameters.
  2. Increase the maximum depth of trees trained at each round.
  3. And most importantly, continue training for many iterations without regard to the models’ performance on the test set. 

We also compare the model performance to a “traditionally” trained model with regularization and early stopping. 

So, how do our interpolated models perform?

The Surprising (or Not?) Performance of Our Interpolated Models

It turns out even with these limited initial experiments, we generate some compelling results and are able to observe many of the characteristic features noted in the interpolation literature. In particular, we observe:

  1. The “double dip” phenomenon in model performance.
  2. Model performance continues to improve well after the model has fully memorized the training data. 

Results

The first thing to note is that after 500 rounds of boosting with no regularization and deep trees, we do indeed appear to have a model that is solidly in the interpolation regime. This can be seen in the model’s performance on the training set as we add trees (e.g., iterate through boosting rounds).

In fact, the model is able to perfectly classify the training data well before the hundredth round of boosting. Unfortunately, the observed log loss on the held-out data does not obviously exhibit the “double dip” described in the literature. Rather, the log loss on the holdout (test) data seems to imply that the classical approach is true: The best performance in terms of log loss appears to be reached early on and then degrade as the model continues to fit the training data, as seen in Figure 4.

Figure 4. Log loss on training data vs holdout (test) data as trees are added and model complexity is increased

But wait! Although we use log loss as the objective to minimize in training, we do not really care about it in practice. Instead, we care how well our model can identify malicious activity (true positive rate) without generating too many false positives. In other words, better models achieve higher true positive rates (TPRs) at any given false positive rate (FPR). When we apply this reasoning and visualize how the model’s TPR for a fixed FPR evolves as we add trees to the model, we immediately see the characteristic double dip: The TPR increases, reaches a peak, and then falls, consistent with the traditional bias-variance trade-off story. However, after the model has already reached perfect classification on the training data (0 log loss), the TPR on the holdout data continues to rise. This rise continues for hundreds of additional boosting rounds, even though the model has seemingly trained to completion on the training data. It should be stressed how surprising this is from the perspective of traditional bias-variance trade-off, and it validates that the literature on the performance of interpolated models applies to our domain.

Figure 5. Evolution of TPR at a fixed FPR of 0.0001 as trees are added to the model to increase complexity

 

Figure 6. Evolution of TPR at a fixed FPR of 0.001 as trees are added to the model to increase complexity

More generally, we can observe the evolution of model performance across the entire range of FPRs by visualizing how the model’s receiver operating characteristic (ROC) curves (graphs of TPR vs. FPR) shift as we iterate over the number of trees. 

Figure 7. Evolution of ROC curves as trees are added to the model

Comparison to Other Models

Unfortunately, our interpolated model does not perform as well as our benchmark model trained using regularization and early stopping. However, this is likely simply a reflection of the limited nature of our experimentation, rather than a repudiation of the performance of an interpolated model. Specifically:

  1. Our benchmark model was trained with near-optimal hyper-parameters resulting from extensive hyper-parameter search, whereas we have done no hyper-parameter tuning on the interpolated model.
  2. Our interpolated model was still improving in the latest rounds of training. It is likely that with additional trees, it would yield an even higher performance.

As a result, given our observations here it appears likely that with some additional experimentation, an interpolated model can be trained that outperforms our benchmark.

Conclusions and Next Steps

What can we take away from this small bit of experimentation? And what are the limitations of what we have done here?

  1. We know: Interpolation really does result in improved performance beyond the traditional bias-variance performance peak in our domain. Models definitely improve well into the interpolating regime.
  2. We do not know: if interpolated models are the best because we have not exhaustively searched through possible hyper-parametrizations or benchmarked against all possible regularized models. 

While these results are very preliminary, they are compelling enough to warrant further investigation. In particular, it appears plausible that an appropriately tuned interpolated model may achieve the peak generalization performance out of all possible models. 

Citations and Sources

  1. Dankers FJWM, Traverso A, Wee L, et al. Prediction Modeling Methodology. 2018 Dec 22. In: Kubben P, Dumontier M, Dekker A, editors. Fundamentals of Clinical Data Science [Internet]. Cham (CH): Springer; 2019. Fig. 8.3, [The bias-variance tradeoff. With increased…]. Available from: https://www.ncbi.nlm.nih.gov/books/NBK543534/figure/ch8.Fig3/ doi: 10.1007/978-3-319-99713-1_8
  2. Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov. Does Data Interpolation Contradict Statistical Optimality? https://arxiv.org/abs/1806.09471
  3. Mikhail Belkin , Daniel Hsu , Siyuan Ma , and Soumik Mandal. Reconciling modern machine learning practice and bias-variance trade-off. https://arxiv.org/pdf/1812.11118.pdf
  4. Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. https://arxiv.org/pdf/1903.08560.pdf
  5. Adam J Wyner, Matthew Olson, Justin Bleich. Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers. https://arxiv.org/pdf/1504.07676.pdf
  6. Vitaly Feldman. Does Learning Require Memorization? A Short Tale about a Long Tail. https://arxiv.org/pdf/1906.05271.pdf

Additional Resources



Source link

Share this:

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn

Filed Under: Security Tagged With: Activity, identify, Malicious, Models, Overfit, Trained

Special Offers

  • Dell OptiPlex 7010 RGB Desktop Quad Core Intel i5 (3.2GHz) 8GB DDR3 RAM 250GB SSD Windows 10 Pro (Refurbished) for $162

    Dell OptiPlex 7010 RGB Desktop Quad Core Intel i5 (3.2GHz) 8GB DDR3 RAM 250GB SSD Windows 10 Pro (Refurbished) for $162
  • Dell OptiPlex 5040 (RGB) Desktop Quad Core Intel i5 (3.2GHz) 16GB DDR3 RAM 500GB SSD Windows 10 Pro (Refurbished) for $249

    Dell OptiPlex 5040 (RGB) Desktop Quad Core Intel i5 (3.2GHz) 16GB DDR3 RAM 500GB SSD Windows 10 Pro (Refurbished) for $249
  • Zerrio: The Ultimate All-In-One Business Management Toolkit (Lifetime Subscription) for $59

    Zerrio: The Ultimate All-In-One Business Management Toolkit (Lifetime Subscription) for $59
  • DNS FireWall: Lifetime Subscription for $59

    DNS FireWall: Lifetime Subscription for $59
  • KeepSolid SmartDNS: Lifetime Subscription for $59

    KeepSolid SmartDNS: Lifetime Subscription for $59

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

  • Facebook
  • GitHub
  • Instagram
  • Pinterest
  • Twitter
  • YouTube

More to See

Dell OptiPlex 5040 (RGB) Desktop Quad Core Intel i5 (3.2GHz) 16GB DDR3 RAM 500GB SSD Windows 10 Pro (Refurbished) for $249

Jun 6, 2023 By iHash

The Importance of Data Quality in Benefits

Jun 6, 2023 By iHash

Tags

* Apple Cisco computer security cyber attacks cyber crime cyber news cybersecurity Cyber Security cyber security news cyber security news today cyber security updates cyber threats cyber updates data data breach data breaches google hacker hacker news Hackers hacking hacking news how to hack incident response information security iOS 7 iOS 8 iPhone Malware microsoft network security ransomware ransomware malware risk management Secure security security breaches security vulnerabilities software vulnerability the hacker news Threat update video web applications

Latest

Dell OptiPlex 7010 RGB Desktop Quad Core Intel i5 (3.2GHz) 8GB DDR3 RAM 250GB SSD Windows 10 Pro (Refurbished) for $162

Expires January 20, 2123 00:37 PST Buy now and get 62% off KEY FEATURES A reliable desktop for both home and office use. Dell OptiPlex 7010 Desktop is powered by an Intel Quad-Core i5-3450 processor running at 3.2GHz making it perfect for built for professional-grade multitasking, high-speed web browsing, multimedia applications like streaming, or even […]

Apple announces winners of the 2023 Apple Design Awards

June 5, 2023 UPDATE Apple announces winners of the 2023 Apple Design Awards At WWDC23, winners are recognized for excellence in innovation, ingenuity, and technical achievement in app and game design Today, Apple proudly unveiled the winners of its annual Apple Design Awards, celebrating 12 best-in-class apps and games. This year’s winners, spanning development teams around […]

Zerrio: The Ultimate All-In-One Business Management Toolkit (Lifetime Subscription) for $59

Expires June 06, 2123 23:59 PST Buy now and get 93% off KEY FEATURES Zerrio is more than just a business management tool — it’s a partner that supports your success every step of the way! With over 60+ business tools, Zerrio is your one-stop business management hub. For one low monthly fee, you can […]

Dotan Horovits

From Spotify to Open Source: The Backstory of Backstage

Technology juggernauts–despite their larger staffs and budgets–still face the “cognitive load” for DevOps that many organizations deal with day-to-day. That’s what led Spotify to build Backstage, which supports DevOps and platform engineering practices for the creation of developer portals. Eventually, Spotify made the decision to open source Backstage and donate it to the Cloud Native […]

Passwarden PW Manager Lifetime Subscription for $79

Expires June 04, 2024 23:59 PST Buy now and get 60% off KEY FEATURES Safe password manager for those who value security! Passwarden is a secure password manager that simplifies and strengthens your digital life by securely storing and managing all your passwords in one place. It utilizes strong AES-256 encryption algorithms to protect your […]

Heard on the Street – 6/5/2023

Welcome to insideBIGDATA’s “Heard on the Street” round-up column! In this regular feature, we highlight thought-leadership commentaries from members of the big data ecosystem. Each edition covers the trends of the day with compelling perspectives that can provide important insights to give you a competitive advantage in the marketplace. We invite submissions with a focus […]

Jailbreak

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

  Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate.   Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

  Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Copyright iHash.eu © 2023
We use cookies on this website. By using this site, you agree that we may store and access cookies on your device. Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT