text processing
Recently Published Documents


TOTAL DOCUMENTS

1044
(FIVE YEARS 284)

H-INDEX

41
(FIVE YEARS 4)

Author(s):  
Neha Garg ◽  
Kamlesh Sharma

<span>Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition.</span>


Author(s):  
Erhan Turan ◽  
Umut Orhan

In this study, a novel confidence indexing algorithm is proposed to minimize human labor in controlling the reliability of automatically extracted synsets from a non-machine-readable monolingual dictionary. Contemporary Turkish Dictionary of Turkish Language Association is used as the monolingual dictionary data. First, the synonym relations are extracted by traditional text processing methods from dictionary definitions and a graph is prepared in Lemma-Sense network architecture. After each synonym relation is labeled by a proper confidence index, synonym pairs with desired confidence indexes are analyzed to detect synsets with a spanning tree-based method. This approach can label synsets with one of three cumulative confidence levels (CL-1, CL-2, and CL-3). According to the confidence levels, synsets are compared with KeNet which is the only open access Turkish Wordnet. Consequently, while most matches with the synsets of KeNet is determined in CL-1 and CL-2 confidence levels, the synsets determined at CL-3 level reveal errors in the dictionary definitions. This novel approach does not find only the reliability of automatically detected synsets, but it can also point out errors of detected synsets from the dictionary.


Author(s):  
Meftah Mohammed Charaf Eddine

In the field of machine translation of texts, the ambiguity in both lexical (dictionary) and structural aspects is still one of the difficult problems. Researchers in this field use different approaches, the most important of which is machine learning in its various types. The goal of the approach that we propose in this article is to define a new concept of electronic text, which makes the electronic text free from any lexical or structural ambiguity. We used a semantic coding system that relies on attaching the original electronic text (via the text editor interface) with the meanings intended by the author. The author defines the meaning desired for each word that can be a source of ambiguity. The proposed approach in this article can be used with any type of electronic text (text processing applications, web pages, email text, etc.). Thanks to the approach that we propose and through the experiments that we have conducted using it, we can obtain a very high accuracy rate. We can say that the problem of lexical and structural ambiguity can be completely solved. With this new concept of electronic text, the text file contains not only the text but also with it the true sense of the exact meaning intended by the writer in the form of symbols. These semantic symbols are used during machine translation to obtain a translated text completely free of any lexical and structural ambiguity.


2022 ◽  
pp. 1-17
Author(s):  
Mike Thelwall ◽  
Pardeep Sud

Abstract Scientometric research often relies on large-scale bibliometric databases of academic journal articles. Long term and longitudinal research can be affected if the composition of a database varies over time, and text processing research can be affected if the percentage of articles with abstracts changes. This article therefore assesses changes in the magnitude of the coverage of a major citation index, Scopus, over 121 years from 1900. The results show sustained exponential growth from 1900, except for dips during both world wars, and with increased growth after 2004. Over the same period, the percentage of articles with 500+ character abstracts increased from 1% to 95%. The number of different journals in Scopus also increased exponentially, but slowing down from 2010, with the number of articles per journal being approximately constant until 1980, then tripling due to megajournals and online-only publishing. The breadth of Scopus, in terms of the number of narrow fields with substantial numbers of articles, simultaneously increased from one field having 1000 articles in 1945 to 308 in 2020. Scopus’s international character also radically changed from 68% of first authors from Germany and the USA in 1900 to just 17% in 2020, with China dominating (25%). Peer Review https://publons.com/publon/10.1162/qss_a_00177


2022 ◽  
Vol 23 (1) ◽  
pp. 82-94
Author(s):  
Febiarty Wulan Suci ◽  
Nur Hayatin ◽  
Yuda Munarko

Stemming has an important role in text processing. Stemming of each language is different and strongly affected by the type of text language. Besides that, each language has different rules in the use of words with an affix. A large number of the words used in the Indonesian language are formed by combining root words with affixes and other combining forms. One of the problems in Indonesian stemming is having different types of affixes, and also having some prefixes that changes according to the first letters of the root words. Implementing Idris stemmer for Indonesian text is of interest because Indonesia and Malaysia have the same language root. However, the results do not always produce the actual word, because the Idris algorithm first removes the prefix according to Rule 2. This elimination directly affected the Idris stemmer result when implemented to Indonesian text. In this study, we focus on a modified Idris stemmer (from Malay) to IN-Indris with Indonesia context. In order to test the proposed modification to the original algorithm, Indonesian online novels excerpts are used to measure the performance of IN-Idris.test was conducted to compare the proposed algorithm with other stemmers. From the experiment result, IN-Idris had an accuracy of approximately 82.81%. There was an increased accuracy up to 5.25% when compared to Idris accuracy. Moreover, the proposed stemmer is also running faster than Idris with a gap of speed of around 0.25 seconds. ABSTRAK: Stemming mempunyai peranan penting dalam pemprosesan teks. Stem setiap bahasa adalah berbeza dan sangat dipengaruhi oleh jenis bahasa teks. Selain itu, setiap bahasa mempunyai peraturan yang berbeza dalam penggunaan kata dengan awalan. Sebilangan besar kata-kata yang digunakan dalam bahasa Indonesia dibentuk dengan menggabungkan kata akar dengan afiks dan bentuk gabungan lain. Salah satu masalah dalam bahasa Indonesia adalah mempunyai pelbagai jenis awalan, dan juga mempunyai beberapa awalan yang berubah sesuai dengan huruf pertama kata dasar. Menerapkan stemder Idris untuk teks Indonesia adalah minat kerana Indonesia dan Malaysia mempunyai akar bahasa yang sama. Namun, hasilnya tidak selalu menghasilkan kata yang sebenarnya, kerana algoritma Idris pertama kali menghapus awalan menurut Peraturan 2. Penghapusan ini secara langsung mempengaruhi hasil batang Idris ketika diterapkan ke teks Indonesia. Dalam kajian ini, kami memfokuskan pada stemmer Idris yang diubahsuai (dari bahasa Melayu) ke IN-Indris dengan konteks Indonesia. Untuk menguji cadangan pengubahsuaian pada algoritma asli, petikan novel dalam talian Indonesia digunakan untuk mengukur prestasi IN-Idris. Ujian dilakukan untuk membandingkan algoritma yang dicadangkan dengan stemmer lain. Dari hasil eksperimen, IN-Idris mempunyai ketepatan sekitar 82,81%, ada peningkatan ketepatan hingga 5,25% dibandingkan dengan ketepatan Idris. Selain itu, stemmer yang dicadangkan juga berjalan lebih cepat daripada Idris dengan jurang kelajuan sekitar 0.25 saat.


2022 ◽  
Vol 126 ◽  
pp. 107016
Author(s):  
Valeria A. Pfeifer ◽  
Emma L. Armstrong ◽  
Vicky Tzuyin Lai
Keyword(s):  

Author(s):  
Ling Wang ◽  
Minglei Shan ◽  
Tong Li ◽  
Yingxuan Tang ◽  
Tiehua Zhou

Author(s):  
Б.Х. Борлыкова ◽  
Б.В. Меняев ◽  
Т.В. Басанова

В настоящей статье впервые на основе методологии, разработанной О. Д. Суриковой (2020), рассматривается ономастикон сарт-калмыцкой версии эпоса «Джангар». Авторами статьи предлагается систематизация ономастикона эпоса, приводится этимологияонимов, указывается их частотность, а также выявляются варианты собственных имён в других версиях эпоса «Джангар» и фольклорных образцах сарт-калмыков. В исследовании применялась комплексная методика лингвистического анализа, включающая описательный метод, методы контекстуального, сопоставительного и статистического анализов. Материалом для анализа послужил текст рукописи «Джангар», записанный в 1929 г. А. В. Бурдуковым от Бакхи Сарпекова (1872 г. р.) в селе Чельпек Иссык-Кульской области Киргизии. В качестве дополнительного материала были привлечены опубликованные песни калмыцкой и синьцзян-ойратской версий эпоса «Джангар», лексикографические источники, а также личные полевые записи авторов.В результате обработки текста выявлены три группы собственных имён: топонимы, антропонимы и иппонимы. Наибольшую частотность в сарт-калмыцкой версии эпоса имеют буддийские антропонимы, что, очевидно, связано с сильным влиянием буддизма на архаичный жанр – эпос. Наличие названий водных объектов Или и Текес указывают на историческую родину сарт-калмыков – Джунгарию. В целом, изучение собственных имён, извлечённых авторами статьи из текста рукописи А. В. Бурдукова, полезно с точки зрения лингвогеографии и способствует выявлению закономерностей номинационных процессов в сарт-калмыцком языке. The present article is the first to consider the onomasticon of the Sart-Kalmyk version of the Jangar epic on the basis of the methodology developed by O. D. Surikova (2020). The authors of the article propose a systematization of the onomasticon of the epic that includes giving the etymology of onyms, indicating their frequency, and revealing variants of their own names in other versions of the epic Jangarand folklore samples of the Sart-Kalmyks. The study uses a complex method of linguistic analysis, including a descriptive method, methods of contextual, comparative and statistical analysis. The material for the analysis was the text of the manuscript Jangar, written down in 1929 by A. V. Burdukov from Bakhi Sarpekov (born 1872) in the Chelpek villag, Issyk-Kul region, Kyrgyzstan. The published songs of the Kalmyk and Xinjiang Oirat versions of the Jangarepic, lexicographic sources, as well as the authors' personal field notes were used as an additional material. As a result of text processing, three groups of proper names were identified: toponyms, anthroponyms, and hyponyms. The most frequent in the Sart-Kalmyk version of the epic were Buddhist anthroponyms, the fact can be obviously associated with the strong influence of Buddhism on this archaic genre – the epic. The presence of the names of the water bodies Ili and Tekes indicate the historical homeland of the Sart Kalmyks – Dzungaria. In general, the study of their own names, extracted by the authors of the article from the text of the manuscript of A. V. Burdukov is useful from the point of view of linguo-geography and helps to identify patterns of nomination processes in the Sart-Kalmyk language.


Neofilolog ◽  
2021 ◽  
pp. 265-279
Author(s):  
Marta Gierzyńska

Refining reading comprehension skills is one of the basic tasks that students of foreign languages and their teachers face. Some of the texts offered in students’ books do not make pupils interested in the topic, motivated or encouraged enough to develop their reading comprehension ability. The reason is that these texts are compulsory students have to read them which thus makes the readings less interesting. Learning foreign languages essentially should be focused on their posterior use in extracurricular, personal and then occupational conditions. Since reading is necessary, then choosing simple books in a foreign language would be worth considering. Such texts may be perceived by the students differently from the compulsory readings in class. What is more, reading didactically adapted, longer literary texts and their proper reception will certainly be satisfying for the student. This article provides the reader with the examples of using simplified readings in the global and detailed text processing which, in turn, takes into account techniques such as pre-reading, while-reading and post-reading. Furthermore, the article presents the ideas of including Information and Communication Technologies in teaching which influences students’ engagement and allows the teacher to monitor their work.


Author(s):  
Harsh Goyal ◽  
Piyush Piyush ◽  
Ravinder Ravinder ◽  
Pooja Gupta

Medicine side effects are the major problem in the world, due to wrong prescriptions thousands of people die every year. Most of these mistakes are due to illegible handwriting which leads to taking the wrong medicine or dosage. To solve this issue, a voice-based prescription came into the picture where the prescription is taken as voice input, and a pdf file is generated which is then emailed to the patient. This method can save wealth and life throughout the world, particularly in developing countries where the prescriptions are generally paper-based. The system proposed in this paper is for those doctors and hospitals that are still using a paper-based handwritten prescription. Keywords: Healthcare, Voice-based, Python, Natural Language Processing (NLP), Electronic Prescription, Text Processing, Electronic Health Record (EHR).


Sign in / Sign up

Export Citation Format

Share Document