Advanced Data Analytics for Clinical Research Part II: Application to Cardiothoracic Surgery

In the first part of this series, we introduced the tools of Big Data, including Not Only Standard Query Language data warehouse, natural language processing (NLP), optical character recognition (OCR), and Internet of Things (IoT). There are nuances to the utilization of these analytics tools, which must be well understood by clinicians seeking to take advantage of these innovative research strategies. One must recognize technical challenges to NLP, such as unintended search outcomes and variability in the expression of human written texts. Other caveats include dealing written texts in image formats, which may ultimately be handled with transformation to text format by OCR, though this technology is still under development. IoT is beginning to be used in cardiac monitoring, medication adherence alerts, lifestyle monitoring, and saving traditional labs from equipment failure catastrophes. These technologies will become more prevalent in the future research landscape, and cardiothoracic surgeons should understand the advantages of these technologies to propel our research to the next level. Experience and understanding of technology are needed in building a robust NLP search result, and effective communication with the data management team is a crucial step in successful utilization of these technologies. In this second installment of the series, we provide examples of published investigations utilizing the advanced analytic tools introduced in Part I. We will explain our processes in developing the research question, barriers to achieving the research goals using traditional research methods, tools used to overcome the barriers, and the research findings.

Download Full-text

ARGO, Automatic Record Generator for Oncology: a natural language process-based tool to capture pathology features from onco-hematological reports (Preprint)

10.2196/preprints.27295 ◽

2021 ◽

Author(s):

Gian Maria Zaccaria ◽

Vito Colella ◽

Simona Colucci ◽

Felice Clemente ◽

Fabio Pavone ◽

...

Keyword(s):

Natural Language ◽

Translational Research ◽

Language Processing ◽

Character Recognition ◽

Web Application ◽

Optical Character Recognition ◽

Anatomical Site ◽

Cell Of Origin ◽

Molecular Features ◽

Clinical And Translational Research

BACKGROUND The unstructured nature of medical data from Real-World (RW) patients and the scarce accessibility for researchers to integrated systems restrain the use of RW information for clinical and translational research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports in electronic health records (EHR), thus prompting their standardization and sharing. OBJECTIVE We aimed at designing a tool to capture pathological features directly from hemo-lymphopathology reports and automatically record them into electronic case report forms (eCRFs). METHODS We exploited Optical Character Recognition and NLP techniques to develop a web application, named ARGO (Automatic Record Generator for Oncology), that recognizes unstructured information from diagnostic paper-based reports of diffuse large B-cell lymphomas (DLBCL), follicular lymphomas (FL), and mantle cell lymphomas (MCL). ARGO was programmed to match data with standard diagnostic criteria of the National Institute of Health, automatically assign diagnosis and, via Application Programming Interface, populate specific eCRFs on the REDCap platform, according to the College of American Pathologists templates. A selection of 239 reports (n. 106 DLBCL, n.79 FL, and n. 54 MCL) from the Pathology Unit at the IRCCS - Istituto Tumori “Giovanni Paolo II” of Bari (Italy) was used to assess ARGO performance in terms of accuracy, precision, recall and F1-score. RESULTS By applying our workflow, we successfully converted 233 paper-based reports into corresponding eCRFs incorporating structured information about diagnosis, tissue of origin and anatomical site of the sample, major molecular markers and cell-of-origin subtype. Overall, ARGO showed high performance (nearly 90% of accuracy, precision, recall and F1-score) in capturing identification report number, biopsy date, specimen type, diagnosis, and additional molecular features. CONCLUSIONS We developed and validated an easy-to-use tool that converts RW paper-based diagnostic reports of major lymphoma subtypes into structured eCRFs. ARGO is cheap, feasible, and easily transferable into the daily practice to generate REDCap-based EHR for clinical and translational research purposes.

Download Full-text

A Finding Aid for The Equity

Inquiry@Queen's Undergraduate Research Conference Proceedings ◽

10.24908/iqurcp.11597 ◽

2018 ◽

Author(s):

Jeff Blackadar

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Large Collection ◽

Digital History ◽

R Language ◽

Potential Value ◽

Optical Character ◽

Text Searching ◽

Person Location

Bibliothèque et Archives Nationales du Québec digitally scanned and converted to text a large collection of newspapers to create a resource of tremendous potential value to historians. Unfortunately, the text files are difficult to search reliably due to many errors caused by the optical character recognition (OCR) text conversion process. This digital history project applied natural language processing in an R language computer program to create a new and useful index of this corpus of digitized content despite OCR related errors. The project used editions of The Equity, published in Shawville, Quebec since 1883. The program extracted the names of all the person, location and organization entities that appeared in each edition. Each of the entities was cataloged in a database and related to the edition of the newspaper it appeared in. The database was published to a public website to allow other researchers to use it. The resulting index or finding aid allows researchers to access The Equity in a different way than just full text searching. People, locations and organizations appearing in the Equity are listed on the website and each entity links to a page that lists all of the issues that entity appeared in as well as the other entities that may be related to it. Rendering the text files of each scanned newspaper into entities and indexing them in a database allows the content of the newspaper to be interacted with by entity name and type rather than just a set of large text files. Website: http://www.jeffblackadar.ca/graham_fellowship/corpus_entities_equity/

Download Full-text

MELHISSA: a multilingual entity linking architecture for historical press articles

International Journal on Digital Libraries ◽

10.1007/s00799-021-00319-6 ◽

2021 ◽

Author(s):

Elvys Linhares Pontes ◽

Luis Adrián Cabrera-Diego ◽

Jose G. Moreno ◽

Emanuela Boros ◽

Ahmed Hamdi ◽

...

Keyword(s):

Language Processing ◽

Digital Libraries ◽

Character Recognition ◽

Optical Character Recognition ◽

Historical Documents ◽

Entity Linking ◽

Named Entities ◽

European Languages ◽

Meta Information ◽

The Impact

AbstractDigital libraries have a key role in cultural heritage as they provide access to our culture and history by indexing books and historical documents (newspapers and letters). Digital libraries use natural language processing (NLP) tools to process these documents and enrich them with meta-information, such as named entities. Despite recent advances in these NLP models, most of them are built for specific languages and contemporary documents that are not optimized for handling historical material that may for instance contain language variations and optical character recognition (OCR) errors. In this work, we focused on the entity linking (EL) task that is fundamental to the indexation of documents in digital libraries. We developed a Multilingual Entity Linking architecture for HIstorical preSS Articles that is composed of multilingual analysis, OCR correction, and filter analysis to alleviate the impact of historical documents in the EL task. The source code is publicly available. Experimentation has been done over two historical documents covering five European languages (English, Finnish, French, German, and Swedish). Results have shown that our system improved the global performance for all languages and datasets by achieving an F-score@1 of up to 0.681 and an F-score@5 of up to 0.787.

Download Full-text

Which OCR toolset is good and why? A comparative study

Kuwait Journal of Science ◽

10.48129/kjs.v48i2.9589 ◽

2021 ◽

Vol 48 (2) ◽

Author(s):

Pooja Jain ◽

◽

Dr. Kavita Taneja ◽

Dr. Harmunish Taneja ◽

◽

...

Keyword(s):

Comparative Study ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Research Area ◽

Real World Applications ◽

Banking Education ◽

Active Research ◽

Active Research Area ◽

Computational Technology

Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.

Download Full-text

Optical character recognition errors and their effects on natural language processing

International Journal on Document Analysis and Recognition (IJDAR) ◽

10.1007/s10032-009-0094-8 ◽

2009 ◽

Vol 12 (3) ◽

pp. 141-151 ◽

Cited By ~ 26

Author(s):

Daniel Lopresti

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Errors

Download Full-text

Machine Learning Techniques Application

Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing ◽

10.4018/978-1-7998-5339-8.ch068 ◽

2021 ◽

pp. 1396-1417

Author(s):

Karthikeyan P. ◽

Karunakaran Velswamy ◽

Pon Harshavardhanan ◽

Rajagopal R. ◽

JeyaKrishnan V. ◽

...

Keyword(s):

Machine Learning ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Modern World ◽

Interdisciplinary Field ◽

Sound Image ◽

Learning Techniques

Machine learning is the part of artificial intelligence that makes machines learn without being expressly programmed. Machine learning application built the modern world. Machine learning techniques are mainly classified into three techniques: supervised, unsupervised, and semi-supervised. Machine learning is an interdisciplinary field, which can be joined in different areas including science, business, and research. Supervised techniques are applied in agriculture, email spam, malware filtering, online fraud detection, optical character recognition, natural language processing, and face detection. Unsupervised techniques are applied in market segmentation and sentiment analysis and anomaly detection. Deep learning is being utilized in sound, image, video, time series, and text. This chapter covers applications of various machine learning techniques, social media, agriculture, and task scheduling in a distributed system.

Download Full-text

Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data - AND '08 ◽

10.1145/1390749.1390753 ◽

2008 ◽

Cited By ~ 10

Author(s):

Daniel Lopresti

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Optical Character ◽

Recognition Errors

Download Full-text

Mobile-Based Word Matching Detection using Intelligent Predictive Algorithm

International Journal of Interactive Mobile Technologies (iJIM) ◽

10.3991/ijim.v13i09.10848 ◽

2019 ◽

Vol 13 (09) ◽

pp. 140

Author(s):

Hamidah Jantan ◽

Nurul Aisyiah Baharudin

Keyword(s):

Language Processing ◽

Execution Time ◽

Character Recognition ◽

Optical Character Recognition ◽

Design Algorithm ◽

Predictive Algorithm ◽

String Searching ◽

Result Analysis ◽

Future Work ◽

Algorithm Implementation

Word matching is a string searching technique for information retrieval in Natural Language Processing (NLP). There are several algorithms have been used for string search and matching such as Knuth Morris Pratt, Boyer Moore, Horspool, Intelligent Predictive and many other. However, there some issues need to be considered in measuring the performance of the algorithms such as the efficiency for searching small alphabets, time taken in processing the pattern of the text and extra space to support a huge table or state machines. Intelligent Predictive (IP) algorithm capable to solve several word matching issues discovered in other string searching algorithms especially with abilities to skip the pre-processing of the pattern, uses simple rules during matching process and does not involved complex computations. Due to those reasons,<strong> </strong>IP algorithm is used in this study due to the ability of this algorithm to produce a good result in string searching process. This article aims to apply IP algorithm together with Optical Character Recognition (OCR) tool for mobile-based word matching detection. There are four phases in this study consists of data preparation, mobile based system design, algorithm implementation and result analysis. The efficiency of the proposed algorithm was evaluated based on the execution time of searching process among the selected algorithms. The result shows that the IP algorithm for string searching process is more efficient in execution time compared to well-known algorithm i.e. Boyer Moore algorithm. In future work, the performance of string searching process can be enhanced by using other suitable optimization searching techniques such as Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization and many others.

Download Full-text

Image Spam Detection Using Machine Learning and Natural Language Processing

Journal of Southwest Jiaotong University ◽

10.35741/issn.0258-2724.55.2.41 ◽

2020 ◽

Vol 55 (2) ◽

Author(s):

Yaseen Khather Yaseen ◽

Alaa Khudhair Abbas ◽

Ahmed M. Sana

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Optical Character ◽

Harmful Content

Today, images are a part of communication between people. However, images are being used to share information by hiding and embedding messages within it, and images that are received through social media or emails can contain harmful content that users are not able to see and therefore not aware of. This paper presents a model for detecting spam on images. The model is a combination of optical character recognition, natural language processing, and the machine learning algorithm. Optical character recognition extracts the text from images, and natural language processing uses linguistics capabilities to detect and classify the language, to distinguish between normal text and slang language. The features for selected images are then extracted using the bag-of-words model, and the machine learning algorithm is run to detect any kind of spam that may be on it. Finally, the model can predict whether or not the image contains any harmful content. The results show that the proposed method using a combination of the machine learning algorithm, optical character recognition, and natural language processing provides high detection accuracy compared to using machine learning alone.

Download Full-text

APPLICATION OF ZONAL AND CURVATURE FEATURES TO NUMERALS RECOGNITION

International Journal of Students Research in Technology & Management ◽

10.18510/ijsrtm.2021.922 ◽

2021 ◽

Vol 9 (2) ◽

pp. 7-12

Author(s):

Binod Kumar Prasad

Keyword(s):

Language Processing ◽

Character Recognition ◽

Optical Character Recognition ◽

Recognition Rate ◽

Recognition System ◽

Signature Verification ◽

Optical Character ◽

Knn Classifier ◽

Average Recognition Rate ◽

Distance Coding

Purpose of the study: The purpose of this work is to present an offline Optical Character Recognition system to recognise handwritten English numerals to help automation of document reading. It helps to avoid tedious and time-consuming manual typing to key in important information in a computer system to preserve it for a longer time. Methodology: This work applies Curvature Features of English numeral images by encoding them in terms of distance and slope. The finer local details of images have been extracted by using Zonal features. The feature vectors obtained from the combination of these features have been fed to the KNN classifier. The whole work has been executed using the MatLab Image Processing toolbox. Main Findings: The system produces an average recognition rate of 96.67% with K=1 whereas, with K=3, the rate increased to 97% with corresponding errors of 3.33% and 3% respectively. Out of all the ten numerals, some numerals like ‘3’ and ‘8’ have shown respectively lower recognition rates. It is because of the similarity between their structures. Applications of this study: The proposed work is related to the recognition of English numerals. The model can be used widely for recognition of any pattern like signature verification, face recognition, character or word recognition in another language under Natural Language Processing, etc. Novelty/Originality of this study: The novelty of the work lies in the process of feature extraction. Curves present in the structure of a numeral sample have been encoded based on distance and slope thereby presenting Distance features and Slope features. Vertical Delta Distance Coding (VDDC) and Horizontal Delta Distance Coding (HDDC) encode a curve from vertical and horizontal directions to reveal concavity and convexity from different angles.

Download Full-text