Joyce Ho's Term Project Page

From Computational Statistics (CSE383M and CS395T)
Revision as of 03:35, 21 April 2012 by Joyce Whang (talk | contribs)
Jump to navigation Jump to search

Clustering Diseases


Significant effort has been placed on understanding relationships between diseases. Research in the area of network models<ref>Goh, K.I. et al. 2007. The human disease network. Proceedings of the National Academy of Sciences. 104, 21 (2007), 8685. </ref><ref>Hidalgo, C.A. et al. 2009. A Dynamic Network Approach for the Study of Human Phenotypes. PLoS Computational Biology. 5, 4 (Apr. 2009), e1000353.</ref><ref>Barabási, A.-L. et al. 2011. Network medicine: a network-based approach to human disease. Nature Publishing Group. 12, 1 (Jan. 2011), 56–68.</ref>, collaborative filtering<ref>Davis, D.A. et al. 2008. Predicting individual disease risk based on medical history. Proceeding of the 17th ACM conference on Information and knowledge management. (2008), 769–778. </ref>, and text mining<ref>Roque, F.S. et al. 2011. Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts. PLoS Computational Biology. 7, 8 (Aug. 2011), e1002141.</ref>have been used to better understand illness progression, disease prediction, and correlation between diseases. The works have focused primarily on the elderly population or a small subset of diseases. A study of the relationships between diseases for a broader population has not been done even though hospitals and insurance companies have access to patients' medical history using the International Classification of Diseases, Ninth Revision (ICD-9) claim codes. Given a dataset of ICD-9 claim codes from adult (aged 18+) Intensive Care Unit (ICU) patients, I plan to perform hierarchical clustering on the codes to group common diagnoses.


The Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) Clinical Database<ref>Saeed, M. et al. 2011. Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database. Critical Care Medicine. 39, 5 (May. 2011), 952–960.</ref>, a public and freely available dataset, contains data collected from a 7-year period from various Intensive Care Units (ICU) at Boston's Beth Israel Deaconess Medical Center. The dataset contains 23,697 adult patients with 5203 unique ICD-9 codes. The plots below show an initial exploration of the data.

Stays-per-patient.png Code-per-patient.png Code-per-stay.png

The following table summarizes the minimum, median, maximum and the first and third quartiles (Q1 and Q3) for the number of stays, patients, and codes in the data.

Min. Q1 Median Q3 Max.
ICU Stays Per Patient 1 1 1 1 32
ICD-9 Codes Per ICU Stay 1 7 9 11 39
ICD-9 Codes Per Patient 1 7 9 13 308

Hierarchical Clustering

Hierarchical clustering is commonly used to summarize, visualize, and discover relationships. There are two paradigms in hierarchical clustering, agglomerative and divisive. Agglomerative hierarchical clustering recursively merges a pair of clusters into a single cluster, while divisive recursively splits a single cluster into pairs. Agglomerative hierarchical clustering is often displayed graphically in a dendrogram, or a tree-like diagram. The dendragram shows the cluster relationships as well as merge order. The common merging schemes for the agglomerative clustering are single linkage (nearest neighbor), complete linkage (furtherest neighbor), median linkage, group average, and Ward's criterion. Agglomerative clustering has been used on microarray data to organize, visualize, and characterize the data<ref>Hastie, T. et al. 2008. The elements of statistical learning. Springer Verlag.</ref><ref>Eisen, M.B. et al. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. 95, 25 (Dec. 1998), 14863–14868.</ref><ref>Jiang, D. et al. 2004. Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 16, 11 (Nov. 2004).</ref>.

Computer Code

There are existing hierarchical clustering software packages in R. The base R installation contains basic hierarchical clustering methods that is capable of using a user-defined distance matrix. Additionally, a software package (fastcluster) has been developed that efficiently implements seven clustering schemes (single, complete, average, weighted, Ward, centroid and median linkage) for agglomerative clustering.

Clustering the MIMIC Data

I will run various agglomerative hierarchical clustering merging schemes on the ICD-9 codes from the MIMIC database. The distance metric for the merging schemes will be a function of the number of instances of co-occurrence between two diseases during a patient's ICU stay. Although ICD-9 codes already have a hierarchical structure, the hope is that relationships that extend beyond this existing hierarchy.


<references />

It seems that you're using <math>L_1</math> distance (<math>l_1</math> norm of the )