• Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses

Data Science Research

Menu
  • Home
  • Blog
  • People
  • Projects
  • Publications
  • Seminars
  • DSR Expo
  • Courses
Home › Uncategorized › DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

Jayetri Bardhan February 20, 2023     Comment Closed     Uncategorized

by Jayetri Bardhan

Introduction

Electronic Health Records (EHRs) are digitized records of patients’ medical information containing details about their demographics, diagnoses, medication, symptoms, laboratory results, and immunization records. EHRs help doctors in making better clinical decisions and aid patients to obtain answers to patient-specific questions.  EHRs may be in the form of structured or unstructured data. Structured EHRs are mostly in the form of relational databases, while unstructured EHRs are in the form of clinical notes. We developed the first question answering (DrugEHRQA) dataset on multi-modal EHRs, containing question answering (QA) pairs from structured tables and unstructured clinical notes from a popular EHR database, MIMIC III [1]. The dataset contains natural language questions, its corresponding SQL queries for querying multi-relational tables in MIMIC-III, the retrieved answer(s) from one or both modalities, and the combined multi-modal answer. Our dataset on medication-related queries contains over 70,000 question-answer pairs.

Dataset Generation

We have introduced an automated and template-based method to generate the DrugEHRQA dataset. Figure 1 shows the dataset generation framework of DrugEHRQA. We annotated nine natural language medication-related question templates along with their corresponding SQL query templates. The drug attribute annotations of “The 2018 Adverse Drug Event (ADE) dataset and Medical Extraction Challenge dataset” [2] (will be hereby referred to as “challenge dataset”) were used to generate QA pairs on unstructured EHR clinical notes. The six drug-related attributes: Strength-Drug, Form-Drug, Route-Drug, Dosage-Drug, Frequency-Drug, and Reason-Drug were extracted from the challenge dataset to generate nine types of natural language question templates. For example, the annotation from Dosage-Drug for a certain admission ID was used to answer the question – “What is the dosage of |drug| prescribed to the patient with admission id = |hadm_id|?”, where |hadm_id| refers to the admission ID of the patient. The names of medicines, drug attributes, and admission IDs of patients (extracted from the challenge dataset) were then slot-filled in place of the placeholders in the question templates to obtain question-answer pairs on unstructured clinical notes.

Next, to obtain QA pairs from structured tables of MIMIC-III, a slot-filling process was again used to generate the SQL queries that were used to retrieve answers from the MIMIC-III’s structured database. Then, to make the dataset more realistic, we added an additional step of paraphrasing and created three additional paraphrases for every question template. In our final step, rules were developed to generate multimodal answers based on the answers retrieved from structured and unstructured EHR data.

Figure 1: Dataset generation framework of DrugEHRQA

Proposed Multimodal Baseline Pipeline

Our proposed pipeline consists of a modality selection network (shown in Figure 2) that predicts the modality (i.e. table or text) for a given question. When the selection network selects “text” as the modality, QA is carried out using BERT [3] and ClinicalBERT [4], with a reading comprehension task performed to determine the span of text. If the modality selection network predicts “table” as the modality, then TREQS [5] is used to perform a text-to-SQL task.

Please refer to our paper [6] for more details.

Figure 2: Modality Selection Network

References

[1]  Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016b). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.

[2] Henry, S., Buchan, K., Filannino, M., Stubbs, A., and Uzuner, O. (2020). 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association, 27(1):3–12.

[3] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.

[4] Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., and McDermott, M. (2019). Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78.

[5] Wang, P., Shi, T., and Reddy, C. K. (2020b). Text-to-sql generation for question answering on electronic medical records. In Proceedings of The Web Conference 2020, pages 350–361.

[6] Bardhan J., Colas, A., Roberts K., Wang, D. Z. (2022). DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries. In Proceedings of the 13th Language Resources and Evaluation Conference, pp. 1083-1097.

Uncategorized

 Previous Post

A Brief Overview of Weak Supervision

― October 16, 2020

Next Post 

DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms

― February 20, 2023

Related Articles

DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
A Brief Overview of Weak Supervision
DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
IDTrees Data Science Challenge: 2017
Efficient Conditional Rule Mining over Knowledge Bases

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017

Categories

  • courses
  • ecology
  • NIST and open eval
  • publications
  • research
  • research directions
  • survey
  • Uncategorized

Archives

  • February 2023
  • October 2020
  • December 2019
  • April 2019
  • December 2018
  • August 2018
  • February 2018
  • November 2017
  • June 2017
  • May 2017
  • March 2017
  • December 2016
  • October 2016
  • April 2016
  • March 2016
  • December 2015
  • November 2015
  • October 2015
  • May 2015
  • November 2014
  • October 2014
  • July 2014
  • May 2014
  • March 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013

Recent Posts

  • DBSim: Extensible Database Simulator for Fast Prototyping In-Database Algorithms
  • DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries
  • A Brief Overview of Weak Supervision
  • DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs
  • IDTrees Data Science Challenge: 2017