Los Alamos AI Identifies 29 Times More Types of Malware

Los Alamos National Laboratory is using artificial intelligence to address several critical shortcomings in large-scale malware analysis, making significant advancements in the classification of Microsoft Windows malware and paving the way for enhanced cybersecurity measures. Using their approach, the team set a new world record in classifying malware families.

“Artificial intelligence methods developed for cyber-defense systems, including systems for large-scale malware analysis, need to consider real-world challenges,” said Maksim Eren, a scientist in Advanced Research in Cyber Systems at Los Alamos. “Our method addresses several of them.”

Cyber defense teams need to quickly identify infected machines and malicious programs. These malicious programs can be uniquely crafted for their victims, which makes gathering large numbers of samples for traditional machine learning methods difficult.

This new method can accurately work with samples with both larger and smaller datasets at the same time — called class imbalance — allowing it to detect both rare and prominent malware families. It can also reject predictions if it is not confident in its answer. This could give security analysts the confidence to apply these techniques to practical high-stakes situations like cyber defense for detecting novel threats. Distinguishing between novel threats and known types of malware specimens is an essential capability to develop mitigation strategies. Additionally, this method can maintain its performance even when limited data is used in its training.

The paper sets a new world record by simultaneously classifying an unprecedented number of malware families, surpassing prior work by a factor of 29, in addition to operating under extremely difficult real-world conditions of limited data, extreme class-imbalance and with the presence of novel malware families.

ACM Transactions on Privacy and Security – Semi-Supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection

Abstract
Identification of the family to which a malware specimen belongs is essential in understanding the behavior of the malware and developing mitigation strategies. Solutions proposed by prior work, however, are often not practicable due to the lack of realistic evaluation factors. These factors include learning under class imbalance, the ability to identify new malware, and the cost of production-quality labeled data. In practice, deployed models face prominent, rare, and new malware families. At the same time, obtaining a large quantity of up-to-date labeled malware for training a model can be expensive. In this article, we address these problems and propose a novel hierarchical semi-supervised algorithm, which we call the HNMFk Classifier, that can be used in the early stages of the malware family labeling process. Our method is based on non-negative matrix factorization with automatic model selection, that is, with an estimation of the number of clusters. With HNMFk Classifier, we exploit the hierarchical structure of the malware data together with a semi-supervised setup, which enables us to classify malware families under conditions of extreme class imbalance. Our solution can perform abstaining predictions, or rejection option, which yields promising results in the identification of novel malware families and helps with maintaining the performance of the model when a low quantity of labeled data is used. We perform bulk classification of nearly 2,900 both rare and prominent malware families, through static analysis, using nearly 388,000 samples from the EMBER-2018 corpus. In our experiments, we surpass both supervised and semi-supervised baseline models with an F1 score of 0.80.