Deakin University
Browse
t012910-kamkar-stablefeature-selection-2.pdf (1.06 MB)

Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso

Download (1.06 MB)
journal contribution
posted on 2015-02-01, 00:00 authored by Iman Kamkar, Sunil GuptaSunil Gupta, Quoc-Dinh Phung, Svetha VenkateshSvetha Venkatesh
Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis.

History

Journal

Journal of biomedical informatics

Volume

53

Pagination

277 - 290

Publisher

Elsevier

Location

Amsterdam, The Netherlands

eISSN

1532-0480

Language

eng

Publication classification

C Journal article; C1 Refereed article in a scholarly journal

Copyright notice

2014, Elsevier