Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso

Kamkar,I, Gupta,SK, Phung,D and Venkatesh,S 2015, Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso, Journal of biomedical informatics, vol. 53, pp. 277-290, doi: 10.1016/j.jbi.2014.11.013.

Attached Files
Name Description MIMEType Size Downloads

Title Stable feature selection for clinical prediction: Exploiting ICD tree structure using Tree-Lasso
Author(s) Kamkar,I
Gupta,SKORCID iD for Gupta,SK
Phung,DORCID iD for Phung,D
Venkatesh,SORCID iD for Venkatesh,S
Journal name Journal of biomedical informatics
Volume number 53
Start page 277
End page 290
Total pages 14
Publisher Elsevier
Place of publication Amsterdam, The Netherlands
Publication date 2015-02
ISSN 1532-0480
Keyword(s) Classification
Feature selection
Feature stability
Summary Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis.
Language eng
DOI 10.1016/j.jbi.2014.11.013
Field of Research 080109 Pattern Recognition and Data Mining
Socio Economic Objective 970108 Expanding Knowledge in the Information and Computing Sciences
HERDC Research category C1 Refereed article in a scholarly journal
ERA Research output type C Journal article
Copyright notice ©2014, Elsevier
Persistent URL

Connect to link resolver
Unless expressly stated otherwise, the copyright for items in DRO is owned by the author, with all rights reserved.

Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 19 times in TR Web of Science
Scopus Citation Count Cited 28 times in Scopus
Google Scholar Search Google Scholar
Access Statistics: 497 Abstract Views, 3 File Downloads  -  Detailed Statistics
Created: Fri, 17 Apr 2015, 10:04:09 EST

Every reasonable effort has been made to ensure that permission has been obtained for items included in DRO. If you believe that your rights have been infringed by this repository, please contact