A hybrid model for spelling error detection and correction for Urdu language

Author(s):  
Romila Aziz ◽  
Muhammad Waqas Anwar ◽  
Muhammad Hasan Jamal ◽  
Usama Ijaz Bajwa
2015 ◽  
Vol 764-765 ◽  
pp. 955-959
Author(s):  
Jui Feng Yeh ◽  
Cheng Hsien Lee ◽  
Yun Yun Lu ◽  
Guan Huei Wu ◽  
Yao Yi Wang

This paper proposed a spelling error detection and correction using the linguistic features and knowledge resource. The linguistic features mainly come from language model that describes the probability of a sentence. In practice, the formal document with typos is defective and fall short of the specifications, since typos and error hidden in printed document are frequent, rework will cause the waste of paper and ink. This paper proposed an approach that addresses the spelling errors and before printing. In this method, the linguistic features are used in this research to compare and increase a new feature additionally that is a function of Internet search based on knowledge bases. Combining these research manners, this paper expect to achieve the goals of confirming, improving the detection rate of typos, and reducing the waste of resources. Experimental results shows, the proposed method is practicable and efficient for users to detect the typos in the printed documents.


2019 ◽  
Vol 26 (3) ◽  
pp. 211-218 ◽  
Author(s):  
Chris J Lu ◽  
Alan R Aronson ◽  
Sonya E Shooshan ◽  
Dina Demner-Fushman

Abstract Objective Automated understanding of consumer health inquiries might be hindered by misspellings. To detect and correct various types of spelling errors in consumer health questions, we developed a distributable spell-checking tool, CSpell, that handles nonword errors, real-word errors, word boundary infractions, punctuation errors, and combinations of the above. Methods We developed a novel approach of using dual embedding within Word2vec for context-dependent corrections. This technique was used in combination with dictionary-based corrections in a 2-stage ranking system. We also developed various splitters and handlers to correct word boundary infractions. All correction approaches are integrated to handle errors in consumer health questions. Results Our approach achieves an F1 score of 80.93% and 69.17% for spelling error detection and correction, respectively. Discussion The dual-embedding model shows a significant improvement (9.13%) in F1 score compared with the general practice of using cosine similarity with word vectors in Word2vec for context ranking. Our 2-stage ranking system shows a 4.94% improvement in F1 score compared with the best 1-stage ranking system. Conclusion CSpell improves over the state of the art and provides near real-time automatic misspelling detection and correction in consumer health questions. The software and the CSpell test set are available at https://umlslex.nlm.nih.gov/cSpell.


2021 ◽  
Author(s):  
Jonas Sjöbergh ◽  
Viggo Kann

We present an online API to access a number of Natural Language Processing services developed at KTH. The services work on Swedish text. They include tokenization, part-of-speech tagging, shallow parsing, compound word analysis, word inflection, lemmatization, spelling error detection and correction, grammar checking, and more. The services can be accessed in several ways, including a RESTful interface, direct socket communication, and premade Web forms. The services are open to anyone. The source code is also freely available making it possible to set up another server or run the tools locally. We have also evaluated the performance of several of the services and compared them to other available systems. Both the precision and the recall for the Granska grammar checker are higher than for both Microsoft Word and Google Docs. The evaluation also shows that the recall is greatly improved when combining all the grammar checking services in the API, compared to any one method, and combining services is made easy by the API.


2015 ◽  
Vol 22 (5) ◽  
pp. 751-773 ◽  
Author(s):  
MOHAMMED ATTIA ◽  
PAVEL PECINA ◽  
YOUNES SAMIH ◽  
KHALED SHAALAN ◽  
JOSEF VAN GENABITH

AbstractA spelling error detection and correction application is typically based on three main components: a dictionary (or reference word list), an error model and a language model. While most of the attention in the literature has been directed to the language model, we show how improvements in any of the three components can lead to significant cumulative improvements in the overall performance of the system. We develop our dictionary of 9.2 million fully-inflected Arabic words (types) from a morphological transducer and a large corpus, validated and manually revised. We improve the error model by analyzing error types and creating an edit distance re-ranker. We also improve the language model by analyzing the level of noise in different data sources and selecting an optimal subset to train the system on. Testing and evaluation experiments show that our system significantly outperforms Microsoft Word 2013, OpenOffice Ayaspell 3.4 and Google Docs.


Sign in / Sign up

Export Citation Format

Share Document