Research – BRIC Big Data

[vc_row][vc_column width=”1/3″][vc_column_text]

Research Topics

4. Note Ranking Using NLP [/vc_column_text][/vc_column]

[vc_column width=”2/3″][vc_column_text]

1. High throughput phenotyping

A major aim of this group is to accurately and efficiently phenotype data from electronic medical records (EMRs) using supervised (described here) and unsupervised methods (described in the PheWAS section). In the supervised field, our group has developed a high-throughput pipeline to classify a specific phenotype. Our method creates features from groupings of ICD codes as well as UMLS groupings of NLP terms in doctors’ notes, then automatically selects a set of candidate features using SAFE [Yu 2016], and then selects the final set of features and coefficients using a regression (lasso?) trained on doctor-labeled patients. Our method has shown to be ___% accurate on Rheumotoid Arthritis (RA), ___% on Myocardial Infarction (MI), and ___ % on Acute Ischemic Stroke (AIS).

[1] Liao, Katherine P., et al. “Development of phenotype algorithms using electronic medical records and incorporating natural language processing.” bmj 350 (2015): h1885.

[2] Yu, Sheng, et al. “Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources.” Journal of the American Medical Informatics Association 22.5 (2015): 993-1000.

[3] Yu, Sheng, et al. “Surrogate-assisted feature extraction for high-throughput phenotyping.” Journal of the American Medical Informatics Association 24.e1 (2016): e143-e149.

2. Phenome-Wide Association Studies (PheWAS)

A PheWAS studies the relationship between one or a few genes and a long list of phenotypes (around 1800 in our case). PheWAS studies have the potential to cheaply and effectively explore options for drug repurposing, the practice of re-using an existing drug for a different endpoint. To do a proof of concept, our group performed a PheWAS on the IL6R gene, which is a gene whose expression is known to have similar effects to the existing anti-rheumatic drugs Tocilizumab and Sarilumab. In this PheWAS, IL6R expression replicated known off-target effects for Tocilizumab, notably a reduced risk of aortic aneurysms and coronary heart disease [Cai, 2018].

[1] Cai, Tianxi, et al. “Association of Interleukin 6 Receptor Variant With Cardiovascular Disease Effects of Interleukin 6 Receptor Blocking Therapy: A Phenome-Wide Association Study.” JAMA cardiology (2018).

3. Unsupervised Phenotyping

For this study, and other PheWAS’s as well, the phenotypes of patients are determined the relevant icd count being above a certain threshold. However, this method can be unreliable and inaccurate for some diseases. We developed the methods PheNorm and MAP, which use mixture models of patients’ total count of diagnosis codes and total count of related terms that show up in doctors’ notes to estimate the probability that each patient has each phenotype. PheNorm and MAP have been shown to improve upon the standard method in __ of ___ cases tested. We ran MAP-derived PheWAS on the IL6R results and found stronger results for expected gene-phenotype associations with MAP-derived phenotypes than with the standard phenotyping method. For more information on PheNorm and MAP refer to papers [Yu, 2016] and [] respectively.

[1] Yu, Sheng, et al. “Enabling phenotypic big data with PheNorm.” Journal of the American Medical Informatics Association 25.1 (2017): 54-60.

4. Note Ranking Using NLP

A major bottleneck in the high-throughput phenotyping process is the necessity of clinician chart review. In order to create gold-standard labels to train and access high-throughput algorithms, clinicians need to review charts for hundreds of patients, and each patient may have hundreds or even thousands of doctors’ notes to sort through. Austin Cai has developed an algorithm to rank doctors’ notes based on their informativeness so that chart reviewers can look at the most useful notes first. The algorithm scans through five common source articles about a target disease (wikipedia, etc…) and ranks notes based on their similarity to these articles (this could use a better explanation). Results?[/vc_column_text][/vc_column]
[/vc_row]