Improving malicious PDF classifier with feature engineering: A data-driven approach

Falah, A; Pan, Lei; Huda, Shamsul; Pokhrel, Shiva; Anwar, Adnan

Improving malicious PDF classifier with feature engineering: A data-driven approach

journal contribution

posted on 2024-06-05, 04:14 authored by A Falah, Lei PanLei Pan, Shamsul HudaShamsul Huda, Shiva PokhrelShiva Pokhrel, Adnan AnwarAdnan Anwar

Several approaches and tools have been developed to analyse and detect the presence of malicious content within the PDF; however, the fundamental approach in designing the existing tools and techniques has not been entirely considerate. Existing tools are based on the available datasets and the observation made during the maldoc manual analysis, making them susceptible to various types of attacks such as Mimicry and Parser confusion. We aim to enhance PDF maldoc classification by identifying the most conclusive feature-set required for accurately classifying PDF maldocs. We extract features using two popular PDF analysis tools and derive a set of features backed by data that further complements classification. We subsequently evaluate all features through a wrapper function. The features with the highest importance values are used to construct a classifier that outperforms the baseline models in terms of classification accuracy and efficiency. Our proposed method helps us identify a useful set of tool-independent features that prolong the current tools’ lifespan and usability. It provides us with an in-depth understanding of how these chosen features cumulatively impact the classification. In addition, we evaluate our findings using real-world samples from VirusTotal. Using our proposed technique, we managed to decrease the size of the feature-set by more than 60% while increasing the classification accuracy by around 2%.

History

Journal

Future Generation Computer Systems

Volume

115

Pagination

314-326

Location

Amsterdam, The Netherlands

Publisher DOI

https://doi.org/10.1016/j.future.2020.09.015

ISSN

0167-739X

eISSN

1872-7115

Language

English

Publication classification

C1 Refereed article in a scholarly journal

Publisher

ELSEVIER

Usage metrics

Keywords

Science & Technology Technology Computer Science, Theory & Methods Computer Science Feature engineering Feature aggregation Machine learning Malicious PDF Malware analysis 4604 Cybersecurity and privacy

Improving malicious PDF classifier with feature engineering: A data-driven approach

History

Journal

Volume

Pagination

Location

Publisher DOI

ISSN

eISSN

Language

Publication classification

Publisher

Usage metrics

Categories

Keywords

Licence

Exports