Deakin University
Browse

Learning biological sequence types using the literature

conference contribution
posted on 2017-01-01, 00:00 authored by Mohamed Reda BouadjenekMohamed Reda Bouadjenek, K Verspoor, J Zobel
© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. We explore in this paper automatic biological sequence type classification for records in biological sequence databases. The sequence type attribute provides important information about the nature of a sequence represented in a record, and is often used in search to filter out irrelevant sequences. However, the sequence type attribute is generally a non-mandatory free-text field, and thus it is subject to many errors including typos, mis-assignment, and nonassignment. In GenBank, this problem concerns roughly 18% of records, an alarming number that should worry the biocuration community. To address this problem of automatic sequence type classification, we propose the use of literature associated to sequence records as an external source of knowledge that can be leveraged for the classification task. We define a set of literature-based features and train a machine learning algorithm to classify a record into one of six primary sequence types. The main intuition behind using the literature for this task is that sequences appear to be discussed differently in scientific articles, depending on their type. The experiments we have conducted on the PubMed Central collection show that the literature is indeed an effective way to address this problem of sequence type classification. Our classification method reached an accuracy of 92.7%, and substantially outperformed two baseline approaches used for comparison.

History

Pagination

1991-1994

Location

Singapore

Start date

2017-11-06

End date

2017-11-10

ISBN-13

9781450349185

Language

eng

Publication classification

E1.1 Full written paper - refereed

Title of proceedings

CIKM 2017 : Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Event

Information and Knowledge Management. Conference (2017 : Singapore)

Publisher

ACM

Place of publication

New York, N.Y.

Usage metrics

    Research Publications

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC