An end-to-end joint model for evidence information extraction from court record document

BACKGROUND Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. OBJECTIVE This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. METHODS We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. RESULTS Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. CONCLUSIONS This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

Download Full-text

DeHyFoNet: Deformable Hybrid Network for Formula Detection in Scanned Document Images

10.20944/preprints202201.0090.v1 ◽

2022 ◽

Author(s):

Muhammad Zeshan Afzal ◽

Khurram Azeem Hashmi ◽

Alain Pagani ◽

Marcus Liwicki ◽

Didier Stricker

Keyword(s):

Information Extraction ◽

Reduction Rate ◽

Error Reduction ◽

Hybrid Network ◽

Document Images ◽

Mathematical Formulas ◽

End To End

This work presents an approach for detecting mathematical formulas in scanned document images. The proposed approach is end-to-end trainable. Since many OCR engines cannot reliably work with the formulas, it is essential to isolate them to obtain the clean text for information extraction from the document. Our proposed pipeline comprises a hybrid task cascade network with deformable convolutions and a Resnext101 backbone. Both of these modifications help in better detection. We evaluate the proposed approaches on the ICDAR-2017 POD and Marmot datasets and achieve an overall accuracy of 96% for the ICDAR-2017 POD dataset. We achieve an overall reduction of error of 13%. Furthermore, the results on Marmot datasets are improved for the isolated and embedded formulas. We achieved an accuracy of 98.78% for the isolated formula and 90.21% overall accuracy for embedded formulas. Consequently, it results in an error reduction rate of 43% for isolated and 17.9% for embedded formulas.

Download Full-text

A Robust End-To-End Information Extraction System for Vietnamese Identity Cards

2019 6th NAFOSTED Conference on Information and Computer Science (NICS) ◽

10.1109/nics48868.2019.9023853 ◽

2019 ◽

Author(s):

Hoan Tran Viet ◽

Quang Hieu Dang ◽

Tuan Anh Vu

Keyword(s):

Information Extraction ◽

Extraction System ◽

End To End ◽

Identity Cards ◽

Information Extraction System

Download Full-text

Unified Embedding Model over Heterogeneous Information Network for Personalized Recommendation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/529 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zekai Wang ◽

Hongzhi Liu ◽

Yingpeng Du ◽

Zhonghai Wu ◽

Xing Zhang

Keyword(s):

Information Extraction ◽

State Of The Art ◽

Structural Features ◽

Personalized Recommendation ◽

Information Network ◽

Data Sparsity ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

Meta Path ◽

End To End

Most of heterogeneous information network (HIN) based recommendation models are based on the user and item modeling with meta-paths. However, they always model users and items in isolation under each meta-path, which may lead to information extraction misled. In addition, they only consider structural features of HINs when modeling users and items during exploring HINs, which may lead to useful information for recommendation lost irreversibly. To address these problems, we propose a HIN based unified embedding model for recommendation, called HueRec. We assume there exist some common characteristics under different meta-paths for each user or item, and use data from all meta-paths to learn unified users’ and items’ representations. So the interrelation between meta-paths are utilized to alleviate the problems of data sparsity and noises on one meta-path. Different from existing models which first explore HINs then make recommendations, we combine these two parts into an end-to-end model to avoid useful information lost in initial phases. In addition, we embed all users, items and meta-paths into related latent spaces. Therefore, we can measure users’ preferences on meta-paths to improve the performances of personalized recommendation. Extensive experiments show HueRec consistently outperforms state-of-the-art methods.

Download Full-text

A Low-Cost Smart Sensor Network for Catchment Monitoring

Sensors ◽

10.3390/s19102278 ◽

2019 ◽

Vol 19 (10) ◽

pp. 2278 ◽

Cited By ~ 2

Author(s):

Dian Zhang ◽

Brendan Heery ◽

Maria O’Neil ◽

Suzanne Little ◽

Noel E. O’Connor ◽

...

Keyword(s):

Data Collection ◽

Information Extraction ◽

Water Level ◽

Low Cost ◽

Open Water ◽

Data Driven ◽

Hydrological Processes ◽

Storm Events ◽

Catchment Scale ◽

End To End

Understanding hydrological processes in large, open areas, such as catchments, and further modelling these processes are still open research questions. The system proposed in this work provides an automatic end-to-end pipeline from data collection to information extraction that can potentially assist hydrologists to better understand the hydrological processes using a data-driven approach. In this work, the performance of a low-cost off-the-shelf self contained sensor unit, which was originally designed and used to monitor liquid levels, such as AdBlue, fuel, lubricants etc., in a sealed tank environment, is first examined. This process validates that the sensor does provide accurate water level information for open water level monitoring tasks. Utilising the dataset collected from eight sensor units, an end-to-end pipeline of automating the data collection, data processing and information extraction processes is proposed. Within the pipeline, a data-driven anomaly detection method that automatically extracts rapid changes in measurement trends at a catchment scale. The lag-time of the test site (Dodder catchment Dublin, Ireland) is also analyzed. Subsequently, the water level response in the catchment due to storm events during the 27 month deployment period is illustrated. To support reproducible and collaborative research, the collected dataset and the source code of this work will be publicly available for research purposes.

Download Full-text

Extracting Family History of Patients From Clinical Narratives: Exploring an End-to-End Solution With Deep Learning Models

JMIR Medical Informatics ◽

10.2196/22982 ◽

2020 ◽

Vol 8 (12) ◽

pp. e22982

Author(s):

Xi Yang ◽

Hansi Zhang ◽

Xing He ◽

Jiang Bian ◽

Yonghui Wu

Keyword(s):

Deep Learning ◽

Family History ◽

Information Extraction ◽

Language Processing ◽

Conditional Random Fields ◽

Short Term Memory ◽

Majority Voting ◽

Learning Models ◽

Concept Extraction ◽

End To End

Background Patients’ family history (FH) is a critical risk factor associated with numerous diseases. However, FH information is not well captured in the structured database but often documented in clinical narratives. Natural language processing (NLP) is the key technology to extract patients’ FH from clinical narratives. In 2019, the National NLP Clinical Challenge (n2c2) organized shared tasks to solicit NLP methods for FH information extraction. Objective This study presents our end-to-end FH extraction system developed during the 2019 n2c2 open shared task as well as the new transformer-based models that we developed after the challenge. We seek to develop a machine learning–based solution for FH information extraction without task-specific rules created by hand. Methods We developed deep learning–based systems for FH concept extraction and relation identification. We explored deep learning models including long short-term memory-conditional random fields and bidirectional encoder representations from transformers (BERT) as well as developed ensemble models using a majority voting strategy. To further optimize performance, we systematically compared 3 different strategies to use BERT output representations for relation identification. Results Our system was among the top-ranked systems (3 out of 21) in the challenge. Our best system achieved micro-averaged F1 scores of 0.7944 and 0.6544 for concept extraction and relation identification, respectively. After challenge, we further explored new transformer-based models and improved the performances of both subtasks to 0.8249 and 0.6775, respectively. For relation identification, our system achieved a performance comparable to the best system (0.6810) reported in the challenge. Conclusions This study demonstrated the feasibility of utilizing deep learning methods to extract FH information from clinical narratives.

Download Full-text