Similarity-based Unsupervised Spelling Correction Using BioWordVec for Bacteria Culture Reports (Preprint)

Mapping Intimacies ◽

10.2196/preprints.25530 ◽

2020 ◽

Author(s):

Tae Hyeong Kim ◽

Min Ji Kang ◽

Se Ha Lee ◽

Jong-Ho Kim ◽

Hyung Joon Joo ◽

...

Keyword(s):

Infectious Diseases ◽

High Performance ◽

Edit Distance ◽

Word Embedding ◽

Medical Terminology ◽

Correction Algorithm ◽

Spelling Errors ◽

Spelling Correction ◽

Terminology Extraction ◽

Bacteria Culture

BACKGROUND Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of dictionaries, traditional spelling correction algorithms that utilize only edit distances have limitations. OBJECTIVE In this research, we proposed a similarity-based spelling correction algorithm using pre-trained word embedding with the BioWordVec technique. This method uses a character-level N-grams-based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place. METHODS For detected typographical errors not mapped to SNOMED clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pre-trained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, the grid search is used to search for candidate groups of similar words. Then, the correction candidate words are ranked in consideration of the frequency of the words, and the typos are finally corrected according to the ranking. RESULTS Bacteria identification words were extracted from 27,544 bacteria culture reports, and 16 types of 914 spelling errors were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 99.45% of all spelling errors. CONCLUSIONS This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in the bacteria culture reports. This method will help build a high-quality refined database of vast text data for electronic health records.

Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports

JMIR Medical Informatics ◽

10.2196/25530 ◽

2021 ◽

Vol 9 (2) ◽

pp. e25530

Author(s):

Taehyeong Kim ◽

Sung Won Han ◽

Minji Kang ◽

Se Ha Lee ◽

Jong-Ho Kim ◽

...

Keyword(s):

Infectious Diseases ◽

Antimicrobial Susceptibility ◽

Edit Distance ◽

Bacterial Culture ◽

Word Embedding ◽

Bacterial Identification ◽

Medical Terminology ◽

Correction Algorithm ◽

Spelling Errors ◽

Spelling Correction

Background Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of a dictionary, traditional spelling correction algorithms that utilize only edit distances have limitations. Objective In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams–based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place. Methods For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking. Results Bacterial identification words were extracted from 27,544 bacterial culture and antimicrobial susceptibility reports, and 16 types of spelling errors and 914 misspelled words were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 97.48% (based on F1 score) of all spelling errors. Conclusions This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in bacterial culture and antimicrobial susceptibility reports. This method will help build a high-quality refined database of vast text data for electronic health records.

Spelling Errors in Korean Students’ Constructed Responses and the Efficacy of Automatic Spelling Correction on Automated Computer Scoring

Technology Knowledge and Learning ◽

10.1007/s10758-021-09568-5 ◽

2021 ◽

Author(s):

Hyeonju Lee ◽

Minsu Ha ◽

Jurim Lee ◽

Rahmi Qurota Aini ◽

Ai Nurlaelasari Rusmana ◽

...

Keyword(s):

Korean Students ◽

Spelling Errors ◽

Spelling Correction ◽

Computer Scoring ◽

Constructed Responses

The Trade-off between Quantity and Quality. Comparing a Large Crawled Corpus and a Small Focused Corpus for Medical Terminology Extraction

Across Languages and Cultures ◽

10.1556/084.2019.20.2.3 ◽

2019 ◽

Vol 20 (2) ◽

pp. 197-211

Author(s):

Veronique Hoste ◽

Klaar Vanopstal ◽

Ayla Rigouts Terryn ◽

Els Lefever

Keyword(s):

Medical Terminology ◽

Trade Off ◽

Terminology Extraction

Pengembangan Modul Preprocessing Teks untuk Kasus Formalisasi dan Pengecekan Ejaan Bahasa Indonesia pada Aplikasi Web Mining Simple Solution (WMSS)

Jurnal Matematika Statistika dan Komputasi ◽

10.20956/jmsk.v15i2.5574 ◽

2018 ◽

Vol 15 (2) ◽

pp. 92

Author(s):

Umi Chuzaimah Chuzaimah Zulkifli

Keyword(s):

Web Mining ◽

Edit Distance ◽

Simple Solution ◽

Spelling Correction ◽

Bahasa Indonesia

Data media sosial saat ini telah banyak digunakan untuk melakukan analisis baik analisis sentimen maupun analisis terkait lainnya. Nyatanya, data yang diperoleh dari media sosial tersebut pada umumnya memiliki kesalahan yang akan mempengaruhi hasil analisis. Kesalahan tersebut berupa penggunaan kata yang tidak baku dan adanya kesalahan ejaan dalam penulisan kata. Solusi yang ditawarkan berupa formalisasi kata dan pengecekan ejaan. Berdasarkan masalah tersebut, akan dibangun modul preprocessing untuk mengatasi dua kesalahan di atas. Metode yang digunakan pada formalisasi adalah mengubah kata ke bentuk formal berdasarkan KBBI sedangkan metode yang digunakan pada pengecekan ejaan adalah spelling correction. Metode spelling correction tersebut terdiri dari tiga yaitu edit distance, bigram dan edit distance + rule. Pada penelitian ini, selain penerapan kedua metode juga akan dilakukan analisis untuk melihat perbandingan hasil pada metode spelling correction. Dari hasil analisis tersebut, diketahui bahwa metode edit distance + rule memiliki akurasi yang lebih tinggi yaitu sebesar 83,39% dibandingkan dengan kedua metode lainnya yaitu edit distance dan bigram. Selain itu, metode edit distance + rule juga memiliki performa tercepat dibandingkan kedua metode lainnya. Secara keseluruhan, metode mengubah kata ke bentuk formal berdasarkan KBBI dan spelling correction telah mampu mengatasi masalah pada dua kasus di atas sehingga dapat meningkatkan akurasi hasil analisis.

Integrating Terminology Extraction and Word Embedding for Unsupervised Aspect Based Sentiment Analysis

Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 ◽

10.4000/books.aaccademia.3297 ◽

2018 ◽

pp. 176-181

Author(s):

Luca Dini ◽

Paolo Curtoni ◽

Elena Melnikova

Keyword(s):

Sentiment Analysis ◽

Word Embedding ◽

Terminology Extraction

Neopterin - a potential diagnostic and prognostic marker in infection diseases

Kazan medical journal ◽

10.17816/kmj2009 ◽

2014 ◽

Vol 95 (6) ◽

pp. 938-943 ◽

Cited By ~ 1

Author(s):

K R Dudina ◽

M M Kutateladze ◽

O O Znoiko ◽

N O Bokova ◽

S A Shutko ◽

...

Keyword(s):

Immune Response ◽

Infectious Diseases ◽

High Performance ◽

Biological Fluids ◽

Transplant Rejection ◽

Interferon Γ ◽

Pathological Process ◽

Enzyme Linked Immunosorbent Assay ◽

Specific Marker ◽

Vector Borne Diseases

Clinical significance of determining the neopterin concentration in body fluids is reviewed. The results of researches on determining the neopterin concentrations in various infectious diseases (vector-borne diseases, herpes, respiratory and intestinal infections, as well as human immunodeficiency virus infection) conducted over the past 2 years are discussed. Neopterin is a biologically stable metabolite, which gives an advantage of its detection to assess the activity of the immune response. Previously neopterin was determined mainly by high-performance liquid chromatography. In recent years, enzyme-linked immunosorbent assay was introduced and frequently used for determining neopterin concentrations. It was shown that neopterin concentrations can vary also in the absence of the pathological process. In particular, some general factors such as race, age, body mass index, smoking and arterial pressure may influence on the concentrations of neopterin in the human body. Increased level of neopterin in body biological fluids and the kynurenine/tryptophan ratio are measured in diseases involving interferon-γ-mediated immune response activation. In this regard, the highest concentrations of neopterin and increased kynurenine/tryptophan ratio are observed in cases of infectious diseases, malignancies, transplant rejection, a number of cardiovascular and autoimmune diseases. It was shown that neopterin can be regarded as a highly specific marker of viral infection, and its blood concentration reflect the prognosis of the disease. Monitoring neopterin level may be useful to assess the severity and activity of an infectious disease, its clinical course, and to control the effectiveness of etiological treatment for many infectious diseases.

Use of bacteriophage for discovery of therapeutically relevant antibodies against infectious diseases

Microbiology Australia ◽

10.1071/ma19007 ◽

2019 ◽

Vol 40 (1) ◽

pp. 33

Author(s):

Martina L Jones

Keyword(s):

Monoclonal Antibodies ◽

Molecular Biology ◽

Infectious Diseases ◽

Phage Display ◽

Clinical Use ◽

Diagnostic Use ◽

Inflammatory Conditions ◽

Wide Range ◽

Antibody Libraries ◽

Bacteria Culture

Scientists George P Smith and Gregory Winter were recently awarded half of the 2018 Nobel Prize for Chemistry for developing a technology to display exogenous peptides and proteins on the surface of bacteriophage. ‘Phage display' has revolutionised the development of monoclonal antibodies, allowing fully human-derived antibodies to be isolated from large antibody libraries. It has been used for the discovery of many blockbuster drugs, including Humira (adalimumab), the highest selling drug yearly since 2012, with US$18.4b in sales globally in 20171. Phage display can be used to isolate antibodies to almost any antigen for a wide range of applications including clinical use (for cancer, inflammatory conditions and infectious diseases), diagnostic use or as research tools. The technology is accessible to any laboratory equipped for molecular biology and bacteria culture.

A high performance systolic chip for spelling correction

Proceedings Euro ASIC '92 ◽

10.1109/euasic.1992.227996 ◽

2003 ◽

Cited By ~ 2

Author(s):

D. Lavenier

Keyword(s):

High Performance ◽

Spelling Correction

Pengembangan Modul PreprocessingTeks untuk Kasus Formalisasi dan Pengecekan Ejaan Bahasa Indonesia pada Aplikasi Web Mining Simple Solution (WMSS)

Jurnal Matematika Statistika dan Komputasi ◽

10.20956/jmsk.v15i2.5718 ◽

2018 ◽

Vol 15 (2) ◽

pp. 95

Author(s):

Umi Chuzaimah Zulkifli

Keyword(s):

Social Media ◽

Sentiment Analysis ◽

Web Mining ◽

Edit Distance ◽

Correction Method ◽

Spelling Correction ◽

The Social ◽

Abstract Data ◽

Rule Method ◽

Bahasa Indonesia

Abstract Data of social media currently has been much used to analyze both sentiment analysis and another analysis. In fact, data that is obtained from the social media in generally has some mistakes which can influence the spelling in writing of words. The solution offered is word formalization and spelling check. Based on the problem, it will be built a preprocessing model to overcome two the mistakes. The method that will be used in formalization is to change the words to be formal form based on KBBI, while the method used for spelling check is spelling correction. Spelling correction method consists of distance edit, bigram and distance edit rule. In this study, in addition the application of both methods, also it will be analyzed comparing the result of spelling correction. From the result of analysis shows that distance edit rule has higher accuracy, namely 83.39% than using both edit distance and bigram method. In addition, edit distance rule method also has faster performance than another both methods. Overall, method to change word to formal word were based on KBBI and spelling correction has been able to overcome the problem of two cases, such that it can increase accuracy of the result of the analysis. Keywords: preprocessing, spelling correction, edit distance, bigram AbstrakData media sosial saat ini telah banyak digunakan untuk melakukan analisis baik analisis sentimen maupun analisis terkait lainnya. Nyatanya, data yang diperoleh dari media sosial tersebut pada umumnya memiliki kesalahan yang akan mempengaruhi hasil analisis. Kesalahan tersebut berupa penggunaan kata yang tidak baku dan adanya kesalahan ejaan dalam penulisan kata. Solusi yang ditawarkan berupa formalisasi kata dan pengecekan ejaan. Berdasarkan masalah tersebut, akan dibangun modul preprocessing untuk mengatasi dua kesalahan di atas. Metode yang digunakan pada formalisasi adalah mengubah kata ke bentuk formal berdasarkan KBBI sedangkan metode yang digunakan pada pengecekan ejaan adalah spelling correction. Metode spelling correction tersebut terdiri dari tiga yaitu edit distance, bigram dan edit distance + rule. Pada penelitian ini, selain penerapan kedua metode juga akan dilakukan analisis untuk melihat perbandingan hasil pada metode spelling correction. Dari hasil analisis tersebut, diketahui bahwa metode edit distance + rule memiliki akurasi yang lebih tinggi yaitu sebesar 83,39% dibandingkan dengan kedua metode lainnya yaitu edit distance dan bigram. Selain itu, metode edit distance + rule juga memiliki performa tercepat dibandingkan kedua metode lainnya. Secara keseluruhan, metode mengubah kata ke bentuk formal berdasarkan KBBI dan spelling correction telah mampu mengatasi masalah pada dua kasus di atas sehingga dapat meningkatkan akurasi hasil analisis. Kata Kunci:preprocessing, spelling correction, edit distance, bigram

Ontology-Based Spelling Correction for Searching Medical Information

Semantic Web Technologies and E-Business ◽

10.4018/978-1-59904-192-6.ch016 ◽

2007 ◽

pp. 384-404

Author(s):

Jane Moon ◽

Frada Burstein

Keyword(s):

Medical Practice ◽

Paradigm Shift ◽

Medical Information ◽

Irrelevant Information ◽

The Internet ◽

Spelling Errors ◽

Spelling Correction ◽

Current Technology ◽

Search Terms ◽

Medical Terms

There has been a paradigm shift in medical practice. More and more consumers are using the Internet as a source for medical information even before seeing a doctor. The well known fact is that medical terms are often hard to spell. Despite advances in technology, the Internet is still producing futile searches when the search terms are misspelled. Often consumers are frustrated with irrelevant information they retrieve as a result of misspelling. An ontology-based search is one way of assisting users in correcting their spelling errors when searching for medical information. This chapter reviews the types of spelling errors that adults make and identifies current technology available to overcome the problem.