Deakin University
Browse

File(s) under embargo

MEG: Masked Ensemble Tabular Data Generator

conference contribution
posted on 2024-03-12, 02:43 authored by Y Zhang, Nayyar ZaidiNayyar Zaidi, Gang LiGang Li, W Buntine
Tabular data generation has seen renewed interest with the advent of Generative Adversarial Networks (GAN). Recently, it has been shown that one can use a Bayesian network as either a generator or a discriminator in the GAN framework, resulting in an algorithm known as GANBLR. It has been shown that GANBLR gives state of the art results for tabular data generation. However, the model has one limitation. It uses class attributes during model training. For example, a supervised Bayesian network is needed as a generator at training time. This makes GANBLR inapplicable for cases where we do not have access to class information. Addressing this shortcoming of GANBLR has been the main motivation of this work. In this work, we have proposed a new model of tabular data generation - Masked Ensemble Tabular Generator (MEG), which does not require class labels to generate tabular data. The proposed models rely on a novel strategy of using a collection of Bayesian networks as part of the generator, and relies on masking operations to train the generator efficiently. It also uses a group-based similarity measure to adjust the number of samples generated from each Bayesian network in the collection. We perform extensive experiments on a variety of datasets and demonstrate that MEG not only outperforms baselines that do not have class information during training, such as CTGAN and TVAE, but also outperforms baselines that provide access to class information during training, such as TableGAN and CtabGANmethods. It has almost similar performance in terms of machine learning utility to GANBLR, and of course is greatly advantaged by being truly unsupervised in nature. We highlight this by demonstrating its applicability to a clustering task. We also investigate the privacy preserving capabilities of MEG and demonstrate its superior performance compared to other baselines.

History

Volume

00

Pagination

838-847

Location

Shanghai, China

Start date

2023-12-01

End date

2023-12-04

ISSN

1550-4786

ISBN-13

9798350307887

Language

eng

Title of proceedings

Proceedings - IEEE International Conference on Data Mining, ICDM

Event

2023 IEEE International Conference on Data Mining (ICDM)

Publisher

IEEE

Place of publication

Piscataway, N.J.

Usage metrics

    Research Publications

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC