Natural Language to SQL Generation for Observational Study Designs: Current Challenges and Possible Directions (Preprint)
UNSTRUCTURED Electronic Health Record (EHR) systems used in hospitals and healthcare institutes generate vast amounts of data stored in relational databases. Structured Query Language (SQL) is a common language used to update, extract and pre-process data in EHR databases. Pre-processing is a necessary step before statistical modeling and causal inference studies can be carried out in observational studies. Data extraction and pre-processing using SQL require a collaborative effort between data engineers and researchers such as clinicians or biostatisticians. Natural Language to SQL (NL2SQL) models converts study designs in natural language to SQL queries to obtain the desired cohort and risk factors. While they cannot completely replace the need for cross-disciplinary collaboration, they have the potential to enable clinicians and biostatisticians who are not trained in SQL to explore EHR databases on their own and reduce the burden placed on data engineers by automating less-complex tasks. There has been substantial research on NL2SQL tasks on general knowledge databases but their application in EHR databases that contain domain-specific knowledge are not well studied. In this paper, we will introduce the general NL2SQL tasks, and discuss in-depth about the potential challenges in developing NL2SQL tools for EHR databases.