Visual Analysis of Bayesian Networks for Electronic Health Records
Worldwide the amount of data generated by the medical community is staggering, and increasing dramatically. Using this data to improve patient care using analytics and machine learning is a huge and largely untapped opportunity. The most important medical data captured exist in patients' electronic health records (EHRs) which are maintained and utilized by health care providers. EHRs consist of rich and comprehensive patient-specific information from a large number of sources in different formats with heterogeneous data types. There are numerous challenges in attempting to apply existing analytic tools and methodologies to this data. Many features extracted from EHRs have dependent relationships - for example, “flu” and “high body temperature”. Bayesian networks, as one of the few modeling methodologies which capture feature dependence rather than assuming independence, provide a flexible foundation for modeling EHRs. However, existing Bayesian network learning methodologies produce models whose complexity makes them difficult for clinicians to utilize or even interpret. Therefore, better model visualization methodologies, as well as learning methods which produce models more amenable to simplification and summarization, are critical to making them interpretable and useful to clinicians, and therefore to improving patient care. In this dissertation, I present a framework for predictive analysis of patient clinical data, from feature extraction to model analysis. I first study straightforward machine learning approaches on extracted EHR features and find that incorporating diagnosis features improves area under ROC curve (AUC) by 10% compared to a baseline. Because of the many dependencies between features extracted from EHRs, I next investigate Bayesian network models, in which my clinician collaborators have identified known and suspected high pressure ulcer risk factors. The models also substantially increase sensitivity of the prediction - nearly three times higher comparing to logistical regression models - without sacrificing overall accuracy. However, interpreting these models involves a significant cognitive burden, motivating my investigation of visual analytic techniques. To this end, I develop an interactive tool for visualizing Bayesian networks to improve clinicians’ insight and interpretation of models. I perform a user study to assess the impact of the tool and its features. The results show quantitatively that users complete tasks more efficiently when using the tool, and qualitatively that they found it useful. Bayesian networks containing natural groupings or “clusters” are better suited to visualization and summarization. Since existing Bayesian network learning methods do not naturally yield such groupings, I alter the Bayesian network learning process to learn structures which optimize not just for representing dependency relationships, but additionally and simultaneously, for clusterability measures. My results show that the augmented Bayesian network process can find structures with much larger clusterability measures, with only a small decrease in their standard scoring measure. Visualizations of learned clustered Bayesian networks show that the algorithm cohesively groups related features, making the networks easier to interpret.