Difference between revisions of "Joyce Ho's Term Project Page"
|Line 37:||Line 37:|
It seems that you're current distance metric is
It seems that you're current distance metric is to using <math>L_1</math> distance (<math>l_1</math> norm) between two disease vectors (as you mentioned that you're using the number of instances of co-occurrence between to disease). I'm not sure if I understand your setting, but I think you can apply other distance metrics such as Euclidean distance, and cosine similarity once you represent each disease as a feature vector. [[User:Joyce Whang|Joyce Whang]] 02:38, 21 April 2012 (CDT)
Revision as of 03:44, 21 April 2012
Significant effort has been placed on understanding relationships between diseases. Research in the area of network models<ref>Goh, K.I. et al. 2007. The human disease network. Proceedings of the National Academy of Sciences. 104, 21 (2007), 8685. </ref><ref>Hidalgo, C.A. et al. 2009. A Dynamic Network Approach for the Study of Human Phenotypes. PLoS Computational Biology. 5, 4 (Apr. 2009), e1000353.</ref><ref>Barabási, A.-L. et al. 2011. Network medicine: a network-based approach to human disease. Nature Publishing Group. 12, 1 (Jan. 2011), 56–68.</ref>, collaborative filtering<ref>Davis, D.A. et al. 2008. Predicting individual disease risk based on medical history. Proceeding of the 17th ACM conference on Information and knowledge management. (2008), 769–778. </ref>, and text mining<ref>Roque, F.S. et al. 2011. Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts. PLoS Computational Biology. 7, 8 (Aug. 2011), e1002141.</ref>have been used to better understand illness progression, disease prediction, and correlation between diseases. The works have focused primarily on the elderly population or a small subset of diseases. A study of the relationships between diseases for a broader population has not been done even though hospitals and insurance companies have access to patients' medical history using the International Classification of Diseases, Ninth Revision (ICD-9) claim codes. Given a dataset of ICD-9 claim codes from adult (aged 18+) Intensive Care Unit (ICU) patients, I plan to perform hierarchical clustering on the codes to group common diagnoses.
The Multiparameter Intelligent Monitoring in Intensive Care (MIMIC) Clinical Database<ref>Saeed, M. et al. 2011. Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database. Critical Care Medicine. 39, 5 (May. 2011), 952–960.</ref>, a public and freely available dataset, contains data collected from a 7-year period from various Intensive Care Units (ICU) at Boston's Beth Israel Deaconess Medical Center. The dataset contains 23,697 adult patients with 5203 unique ICD-9 codes. The plots below show an initial exploration of the data.
The following table summarizes the minimum, median, maximum and the first and third quartiles (Q1 and Q3) for the number of stays, patients, and codes in the data.
|ICU Stays Per Patient||1||1||1||1||32|
|ICD-9 Codes Per ICU Stay||1||7||9||11||39|
|ICD-9 Codes Per Patient||1||7||9||13||308|
Hierarchical clustering is commonly used to summarize, visualize, and discover relationships. There are two paradigms in hierarchical clustering, agglomerative and divisive. Agglomerative hierarchical clustering recursively merges a pair of clusters into a single cluster, while divisive recursively splits a single cluster into pairs. Agglomerative hierarchical clustering is often displayed graphically in a dendrogram, or a tree-like diagram. The dendragram shows the cluster relationships as well as merge order. The common merging schemes for the agglomerative clustering are single linkage (nearest neighbor), complete linkage (furtherest neighbor), median linkage, group average, and Ward's criterion. Agglomerative clustering has been used on microarray data to organize, visualize, and characterize the data<ref>Hastie, T. et al. 2008. The elements of statistical learning. Springer Verlag.</ref><ref>Eisen, M.B. et al. 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences. 95, 25 (Dec. 1998), 14863–14868.</ref><ref>Jiang, D. et al. 2004. Cluster Analysis for Gene Expression Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 16, 11 (Nov. 2004).</ref>.
There are existing hierarchical clustering software packages in R. The base R installation contains basic hierarchical clustering methods that is capable of using a user-defined distance matrix. Additionally, a software package (fastcluster) has been developed that efficiently implements seven clustering schemes (single, complete, average, weighted, Ward, centroid and median linkage) for agglomerative clustering.
Clustering the MIMIC Data
I will run various agglomerative hierarchical clustering merging schemes on the ICD-9 codes from the MIMIC database. The distance metric for the merging schemes will be a function of the number of instances of co-occurrence between two diseases during a patient's ICU stay. Although ICD-9 codes already have a hierarchical structure, the hope is that relationships that extend beyond this existing hierarchy.
It seems that you're current distance metric is similiar to using <math>L_1</math> distance (<math>l_1</math> norm) between two disease vectors (as you mentioned that you're using the number of instances of co-occurrence between to disease). I'm not sure if I thoroughly understand your setting, but I think you can apply other distance metrics such as Euclidean distance, and cosine similarity once you represent each disease as a feature vector. Joyce Whang 02:38, 21 April 2012 (CDT)