File(s) under permanent embargo
Learning Sparse Log-Ratios for High-Throughput Sequencing Data
journal contribution
posted on 2021-02-12, 00:00 authored by Elliott Gordon-Rodriguez, Thomas Quinn, John P CunninghamAbstractThe automatic discovery of interpretable features that are associated to an outcome of interest is a central goal of bioinformatics. In the context of high-throughput genetic sequencing data, and Compositional Data more generally, an important class of features are the log-ratios between subsets of the input variables. However, the space of these log-ratios grows combinatorially with the dimension of the input, and as a result, existing learning algorithms do not scale to increasingly common high-dimensional datasets. Building on recent literature on continuous relaxations of discrete latent variables, we design a novel learning algorithm that identifies sparse log-ratios several orders of magnitude faster than competing methods. As well as dramatically reducing runtime, our method outperforms its competitors in terms of sparsity and predictive accuracy, as measured across a wide range of benchmark datasets.