Deakin University
Browse

Energy-based anomaly detection for mixed data

Version 2 2024-06-04, 12:04
Version 1 2018-07-09, 14:00
journal contribution
posted on 2024-06-04, 12:04 authored by K Do, Truyen TranTruyen Tran, Svetha VenkateshSvetha Venkatesh
Anomalies are those deviating significantly from the norm. Thus, anomaly detection amounts to finding data points located far away from their neighbors, i.e., those lying in low-density regions. Classic anomaly detection methods are largely designed for single data type such as continuous or discrete. However, real-world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Mixed data poses multiple challenges including (a) capturing the inter-type correlation structures and (b) measuring deviation from the norm under multiple types. These challenges are exaggerated under (c) high-dimensional regimes. In this paper, we propose a new scalable unsupervised anomaly detection method for mixed data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that estimates density of mixed data. We propose to use free energy derived from Mv.RBM as anomaly score as it is identical to data negative log-density up to an additive constant. We then extend this method to detect anomalies across multiple levels of data abstraction, an effective approach to deal with high-dimensional settings. The extension is dubbed (Formula presented.), which stands for MIXed data Multilevel Anomaly Detection. In (Formula presented.), we sequentially construct an ensemble of mixed-data Deep Belief Nets (DBNs) with varying depths. Each DBN is an energy-based detector at a predefined abstraction level. Predictions across the ensemble are finally combined via a simple rank aggregation method. The proposed methods are evaluated on a comprehensive suit of synthetic and real high-dimensional datasets. The results demonstrate that for anomaly detection, (a) a proper handling of mixed types is necessary, (b) free energy is a powerful anomaly scoring method, (c) multilevel abstraction of data is important for high-dimensional data, and (d) empirically Mv.RBM and (Formula presented.) are superior to popular unsupervised detection methods for both homogeneous and mixed data.

History

Journal

Knowledge and Information Systems

Volume

57

Pagination

413-435

Location

London, Eng.

ISSN

0219-1377

eISSN

0219-3116

Language

English

Publication classification

C1 Refereed article in a scholarly journal

Copyright notice

2018, Springer-Verlag London Ltd., part of Springer Nature

Issue

2

Publisher

SPRINGER LONDON LTD