Deakin University
Browse

File(s) under permanent embargo

Amalgams: data-driven amalgamation for the dimensionality reduction of compositional data

Version 3 2024-06-19, 01:25
Version 2 2024-06-06, 12:16
Version 1 2021-02-18, 08:17
journal contribution
posted on 2024-06-19, 01:25 authored by Thomas P Quinn, Ionas Erb
Abstract Many next-generation sequencing datasets contain only relative information because of biological and technical factors that limit the total number of transcripts observed for a given sample. It is not possible to interpret any one component in isolation. The field of compositional data analysis has emerged with alternative methods for relative data based on log-ratio transforms. However, these data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension method called data-driven amalgamation. Our new method, implemented in the user-friendly R package amalgam, can reduce the dimensionality of compositional data by finding amalgamations that optimally (i) preserve the distance between samples, or (ii) classify samples as diseased or not. Our benchmark on 13 real datasets confirm that these amalgamations compete with state-of-the-art methods in terms of performance, but result in new features that are easily understood: they are groups of parts added together.

History

Journal

NAR GENOMICS AND BIOINFORMATICS

Volume

2

Article number

ARTN lqaa076

Pagination

lqaa076 - ?

Location

England

ISSN

2631-9268

eISSN

2631-9268

Language

English

Publication classification

C2 Other contribution to refereed journal

Issue

4

Publisher

OXFORD UNIV PRESS

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC