SemEHR: A General-purpose Semantic Search System to Surface Semantic Data from Clinical Notes for Tailored Care, Trial Recruitment and Clinical Research

ABSTRACTObjectiveUnlocking the data contained within both structured and unstructured components of Electronic Health Records (EHRs) has the potential to provide a step change in data available forsecondary research use, generation of actionable medical insights, hospital management and trial recruitment. To achieve this, we implemented SemEHR - a semantic search and analytics, open source tool for EHRs.MethodsSemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualised mentions of a wide range of biomedical concepts within EHRs. Natural Language Processing (NLP) annotations are further assembled at patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data is serviced via ontology-based search and analytics interfaces.ResultsSemEHR has been deployed to a number of UK hospitals including the Clinical Record Interactive Search (CRIS), an anonymised replica of the EHR of the UK South London and Maudsley (SLaM) NHS Foundation Trust, one of Europes largest providers of mental health services. In two CRIS-based studies, SemEHR achieved 93% (Hepatitis C case) and 99% (HIV case) F-Measure results in identifying true positive patients. At King’s College Hospital in London, as part of the CogStack programme (github.com/cogstack), SemEHR is being used to recruit patients into the UK Dept of Health 100k Genome Project (genomicsengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast in searching phenotypes - time for recruitment criteria checking reduced from days to minutes. Validated on an open intensive care EHR data - MIMICIII, the vital signs extracted by SemEHR can achieve around 97% accuracy.ConclusionResults from the multiple case studies demonstrate SemEHR’s efficiency - weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of a patient, bringing in more and unexpected insight compared to study-oriented bespoke information extraction systems.SemEHR is open source available at https://github.com/CogStack/SemEHR.

Download Full-text

SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research*

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx160 ◽

2018 ◽

Vol 25 (5) ◽

pp. 530-537 ◽

Cited By ~ 31

Author(s):

Honghan Wu ◽

Giulia Toti ◽

Katherine I Morley ◽

Zina M Ibrahim ◽

Amos Folarin ◽

...

Keyword(s):

Intensive Care ◽

Open Source ◽

Medical Information ◽

Semantic Search ◽

Clinical Record ◽

Trial Recruitment ◽

Interactive Search ◽

Semantic Data ◽

Wide Range ◽

The Uk

Abstract Objective Unlocking the data contained within both structured and unstructured components of electronic health records (EHRs) has the potential to provide a step change in data available for secondary research use, generation of actionable medical insights, hospital management, and trial recruitment. To achieve this, we implemented SemEHR, an open source semantic search and analytics tool for EHRs. Methods SemEHR implements a generic information extraction (IE) and retrieval infrastructure by identifying contextualized mentions of a wide range of biomedical concepts within EHRs. Natural language processing annotations are further assembled at the patient level and extended with EHR-specific knowledge to generate a timeline for each patient. The semantic data are serviced via ontology-based search and analytics interfaces. Results SemEHR has been deployed at a number of UK hospitals, including the Clinical Record Interactive Search, an anonymized replica of the EHR of the UK South London and Maudsley National Health Service Foundation Trust, one of Europe’s largest providers of mental health services. In 2 Clinical Record Interactive Search–based studies, SemEHR achieved 93% (hepatitis C) and 99% (HIV) F-measure results in identifying true positive patients. At King’s College Hospital in London, as part of the CogStack program (github.com/cogstack), SemEHR is being used to recruit patients into the UK Department of Health 100 000 Genomes Project (genomicsengland.co.uk). The validation study suggests that the tool can validate previously recruited cases and is very fast at searching phenotypes; time for recruitment criteria checking was reduced from days to minutes. Validated on open intensive care EHR data, Medical Information Mart for Intensive Care III, the vital signs extracted by SemEHR can achieve around 97% accuracy. Conclusion Results from the multiple case studies demonstrate SemEHR’s efficiency: weeks or months of work can be done within hours or minutes in some cases. SemEHR provides a more comprehensive view of patients, bringing in more and unexpected insight compared to study-oriented bespoke IE systems. SemEHR is open source, available at https://github.com/CogStack/SemEHR.

Download Full-text

Adapting SVM for data sparseness and imbalance: a case study in information extraction

Natural Language Engineering ◽

10.1017/s1351324908004968 ◽

2009 ◽

Vol 15 (2) ◽

pp. 241-271 ◽

Cited By ~ 31

Author(s):

YAOYONG LI ◽

KALINA BONTCHEVA ◽

HAMISH CUNNINGHAM

Keyword(s):

Active Learning ◽

Language Learning ◽

Information Extraction ◽

Language Processing ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Passive Learning ◽

Wide Range

AbstractSupport Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.

Download Full-text

Dashing: fast and accurate genomic distances with HyperLogLog

Genome Biology ◽

10.1186/s13059-019-1875-0 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 9

Author(s):

Daniel N. Baker ◽

Ben Langmead

Keyword(s):

Open Source ◽

Software Tool ◽

Estimation Methods ◽

Cardinality Estimation ◽

Link Type ◽

Wide Range

AbstractDashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that are specialized for set unions and intersections. Dashing summarizes genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

Download Full-text

NeuroPycon: An open-source Python toolbox for fast multi-modal and reproducible brain connectivity pipelines

10.1101/789842 ◽

2019 ◽

Author(s):

David Meunier ◽

Annalisa Pascarella ◽

Dmitrii Altukhov ◽

Mainak Jas ◽

Etienne Combrisson ◽

...

Keyword(s):

Open Source ◽

Brain Connectivity ◽

Reproducible Research ◽

Link Type ◽

Large Sets ◽

Current Implementation ◽

Wide Range ◽

Neuroimaging Software ◽

Brain Data ◽

Automatic Removal

AbstractRecent years have witnessed a massive push towards reproducible research in neuroscience. Unfortunately, this endeavor is often challenged by the large diversity of tools used, project-specific custom code and the difficulty to track all user-defined parameters. NeuroPycon is an open-source multi-modal brain data analysis toolkit which provides Python-based template pipelines for advanced multi-processing of MEG, EEG, functional and anatomical MRI data, with a focus on connectivity and graph theoretical analyses. Importantly, it provides shareable parameter files to facilitate replication of all analysis steps. NeuroPycon is based on the NiPype framework which facilitates data analyses by wrapping many commonly-used neuroimaging software tools into a common Python environment. In other words, rather than being a brain imaging software with is own implementation of standard algorithms for brain signal processing, NeuroPycon seamlessly integrates existing packages (coded in python, Matlab or other languages) into a unified python framework. Importantly, thanks to the multi-threaded processing and computational efficiency afforded by NiPype, NeuroPycon provides an easy option for fast parallel processing, which critical when handling large sets of multi-dimensional brain data. Moreover, its flexible design allows users to easily configure analysis pipelines by connecting distinct nodes to each other. Each node can be a Python-wrapped module, a user-defined function or a well-established tool (e.g. MNE-Python for MEG analysis, Radatools for graph theoretical metrics, etc.). Last but not least, the ability to use NeuroPycon parameter files to fully describe any pipeline is an important feature for reproducibility, as they can be shared and used for easy replication by others. The current implementation of NeuroPycon contains two complementary packages: The first, called ephypype, includes pipelines for electrophysiology analysis and a command-line interface for on the fly pipeline creation. Current implementations allow for MEG/EEG data import, pre-processing and cleaning by automatic removal of ocular and cardiac artefacts, in addition to sensor or source-level connectivity analyses. The second package, called graphpype, is designed to investigate functional connectivity via a wide range of graph-theoretical metrics, including modular partitions. The present article describes the philosophy, architecture, and functionalities of the toolkit and provides illustrative examples through interactive notebooks. NeuroPycon is available for download via github (https://github.com/neuropycon) and the two principal packages are documented online (https://neuropycon.github.io/ephypype/index.html. and https://neuropycon.github.io/graphpype/index.html). Future developments include fusion of multi-modal data (eg. MEG and fMRI or intracranial EEG and fMRI). We hope that the release of NeuroPycon will attract many users and new contributors, and facilitate the efforts of our community towards open source tool sharing and development, as well as scientific reproducibility.

Download Full-text

FastTrack: An open-source software for tracking varying numbers of deformable objects

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008697 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1008697

Author(s):

Benjamin Gallois ◽

Raphaël Candelier

Keyword(s):

Open Source ◽

Cell Tracking ◽

Ad Hoc ◽

Ground Truth ◽

General Purpose ◽

Two Dimensions ◽

Deformable Objects ◽

Tracking Accuracy ◽

Link Type ◽

Wide Range

Analyzing the dynamical properties of mobile objects requires to extract trajectories from recordings, which is often done by tracking movies. We compiled a database of two-dimensional movies for very different biological and physical systems spanning a wide range of length scales and developed a general-purpose, optimized, open-source, cross-platform, easy to install and use, self-updating software called FastTrack. It can handle a changing number of deformable objects in a region of interest, and is particularly suitable for animal and cell tracking in two-dimensions. Furthermore, we introduce the probability of incursions as a new measure of a movie’s trackability that doesn’t require the knowledge of ground truth trajectories, since it is resilient to small amounts of errors and can be computed on the basis of an ad hoc tracking. We also leveraged the versatility and speed of FastTrack to implement an iterative algorithm determining a set of nearly-optimized tracking parameters—yet further reducing the amount of human intervention—and demonstrate that FastTrack can be used to explore the space of tracking parameters to optimize the number of swaps for a batch of similar movies. A benchmark shows that FastTrack is orders of magnitude faster than state-of-the-art tracking algorithms, with a comparable tracking accuracy. The source code is available under the GNU GPLv3 at https://github.com/FastTrackOrg/FastTrack and pre-compiled binaries for Windows, Mac and Linux are available at http://www.fasttrack.sh.

Download Full-text

Relation Extraction With Clause-Based Open Information Extraction

10.32920/17303840.v1 ◽

2021 ◽

Author(s):

Duc Thuan Vo

Keyword(s):

Natural Language ◽

Information Extraction ◽

Language Processing ◽

Question Answering ◽

Relation Extraction ◽

Linguistic Knowledge ◽

Dependency Parsing ◽

Grammatical Structure ◽

Open Information Extraction ◽

Wide Range

Information Extraction (IE) is one of the challenging tasks in natural language processing. The goal of relation extraction is to discover the relevant segments of information in large numbers of textual documents such that they can be used for structuring data. IE aims at discovering various semantic relations in natural language text and has a wide range of applications such as question answering, information retrieval, knowledge presentation, among others. This thesis proposes approaches for relation extraction with clause-based Open Information Extraction that use linguistic knowledge to capture a variety of information including semantic concepts, words, POS tags, shallow and full syntax, dependency parsing in rich syntactic and semantic structures.<div>Within the plethora of Open Information Extraction that focus on the use of syntactic and dependency parsing for the purposes of detecting relations, incoherent and uninformative relation extractions can still be found. The extracted relations can be erroneous at times and fail to have a meaningful interpretation. As such, we first propose refinements to the grammatical structure of syntactic and dependency parsing with clause structures and clause types in an effort to generate propositions that can be deemed as meaningful extractable relations. Second, considering that choosing the most efficient seeds are pivotal to the success of the bootstrapping process when extracting relations, we propose an extended clause-based pattern extraction method with selftraining for unsupervised relation extraction. The proposed self-training algorithm relies on the clause-based approach to extract a small set of seed instances in order to identify and derive new patterns. Third, we employ matrix factorization and collaborative filtering for relation extraction. To avoid the need for manually predefined schemas, we employ the notion of universal schemas that is formed as a collection of patterns derived from Open Information Extraction tools as well as from relation schemas of pre-existing datasets. While previous systems have trained relations only for entities, we exploit advanced features from relation characteristics such as clause types and semantic topics for predicting new relation instances. Finally, we present an event network representation for temporal and causal event relation extraction that benefits from existing Open IE systems to generate a set of triple relations that are then used to build an event network. The event network is bootstrapped by labeling the temporal and causal disposition of events that are directly linked to each other. The event network can be systematically traversed to identify temporal and causal relations between indirectly connected events. <br></div>

Download Full-text

UKBCC: a cohort curation package for UK Biobank

10.1101/2020.07.12.199810 ◽

2020 ◽

Author(s):

Isabell Kiral ◽

Nathalie Willems ◽

Benjamin Goudey

Keyword(s):

Source Code ◽

Heterogeneous Data ◽

Use Case ◽

Uk Biobank ◽

Link Type ◽

Search Terms ◽

Heterogeneous Data Sources ◽

Wide Range ◽

Critical Resource ◽

The Uk

AbstractSummaryThe UK Biobank (UKB) has quickly become a critical resource for researchers conducting a wide-range of biomedical studies (Bycroft et al., 2018). The database is constructed from heterogeneous data sources, employs several different encoding schemes, and is disparately distributed throughout UKB servers. Consequently, querying these data remains complicated, making it difficult to quickly identify participants who meet a given set of criteria. We have developed UK Biobank Cohort Curator (UKBCC), a Python tool that allows researchers to rapidly construct cohorts based on a set of search terms. Here, we describe the UKBCC implementation, critical sub-modules and functions, and outline its usage through an example use case for replicable cohort creation.AvailabilityUKBCC is available through PyPi (https://pypi.org/project/ukbcc) and as open source code on GitHub (https://github.com/tool-bin/ukbcc)[email protected]

Download Full-text

Dashing: Fast and Accurate Genomic Distances with HyperLogLog

10.1101/501726 ◽

2018 ◽

Cited By ~ 8

Author(s):

Daniel N Baker ◽

Ben Langmead

Keyword(s):

Open Source ◽

Software Tool ◽

Estimation Methods ◽

Cardinality Estimation ◽

Link Type ◽

Wide Range

Download Full-text

Faculty Opinions recommendation of SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.732544930.793561518 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jane Norman

Keyword(s):

Clinical Research ◽

General Purpose ◽

Semantic Search ◽

Trial Recruitment ◽

Search System ◽

Clinical Notes ◽

Semantic Data

Download Full-text

Human mitochondrial variant annotation with HmtNote

10.1101/600619 ◽

2019 ◽

Cited By ~ 3

Author(s):

R. Preste ◽

R. Clima ◽

M. Attimonelli

Keyword(s):

Open Source ◽

Online Resources ◽

Annotation Database ◽

Variant Annotation ◽

Internet Connection ◽

Link Type ◽

Wide Range ◽

Using Data ◽

Cross Reference ◽

Python Package

AbstractHmtNote is a Python package to annotate human mitochondrial variants from VCF files.Variants are annotated using a wide range of information, which are grouped into basic, cross-reference, variability and prediction subsets so that users can either select specific annotations of interest or use them altogether.Annotations are performed using data from HmtVar, a recently published database of human mitochondrial variations, which collects information from several online resources as well as offering in-house pathogenicity predictions.HmtNote also allows users to download a local annotation database, that can be used to annotate variants offline, without having to rely on an internet connection.HmtNote is a free and open source package, and can be downloaded and installed from PyPI (https://pypi.org/project/hmtnote) or GitHub (https://github.com/robertopreste/HmtNote).

Download Full-text