scholarly journals Statistics based Evaluation of English Multi-Word Expressions

The linguistic and statistical information extraction is an important aspect of text processing. The extraction of Multi Word Expression (MWEs) plays a key role in text processing as these are used to find correct meaning of a text phrase. MWEs are the lexical phrases consisting of two or more words conveying some different meaning together other than its constituent words. The linguistics in MWEs extraction is mainly related to the text information including the Part of Speech (POS) tags, grammar rules, related literature, and so on. It is important to extract the correct MWEs for a particular language as there exists variety and veracity in languages. The selection of MWEs are based on the statistical analysis of the MWEs extraction process. In the proposed work, the MWEs extraction is done for English dataset. Along with the existing statistical measures, i.e. Pointwise Mutual Information (PMI), Dice Coefficient (DC) and Modified Dice Coefficient (MDC), the additional measures, Lexical Fixedness (LF), Syntactic Fixedness (SF) and Relevance Measure (RM) are also been evaluated. The results are compared with the other existing approaches applied for English MWEs. The results shows that the proposed measures LF, SF and RM are more significant than existing measures to find the best statistics for the MWEs extraction process. The process model is generic in nature and not adhered to a particular language. It can also be applied for other languages by selecting POS tags for that particular language.

TAPPI Journal ◽  
2012 ◽  
Vol 11 (8) ◽  
pp. 17-24 ◽  
Author(s):  
HAKIM GHEZZAZ ◽  
LUC PELLETIER ◽  
PAUL R. STUART

The evaluation and process risk assessment of (a) lignin precipitation from black liquor, and (b) the near-neutral hemicellulose pre-extraction for recovery boiler debottlenecking in an existing pulp mill is presented in Part I of this paper, which was published in the July 2012 issue of TAPPI Journal. In Part II, the economic assessment of the two biorefinery process options is presented and interpreted. A mill process model was developed using WinGEMS software and used for calculating the mass and energy balances. Investment costs, operating costs, and profitability of the two biorefinery options have been calculated using standard cost estimation methods. The results show that the two biorefinery options are profitable for the case study mill and effective at process debottlenecking. The after-tax internal rate of return (IRR) of the lignin precipitation process option was estimated to be 95%, while that of the hemicellulose pre-extraction process option was 28%. Sensitivity analysis showed that the after tax-IRR of the lignin precipitation process remains higher than that of the hemicellulose pre-extraction process option, for all changes in the selected sensitivity parameters. If we consider the after-tax IRR, as well as capital cost, as selection criteria, the results show that for the case study mill, the lignin precipitation process is more promising than the near-neutral hemicellulose pre-extraction process. However, the comparison between the two biorefinery options should include long-term evaluation criteria. The potential of high value-added products that could be produced from lignin in the case of the lignin precipitation process, or from ethanol and acetic acid in the case of the hemicellulose pre-extraction process, should also be considered in the selection of the most promising process option.


Author(s):  
Nindian Puspa Dewi ◽  
Ubaidi Ubaidi

POS Tagging adalah dasar untuk pengembangan Text Processing suatu bahasa. Dalam penelitian ini kita meneliti pengaruh penggunaan lexicon dan perubahan morfologi kata dalam penentuan tagset yang tepat untuk suatu kata. Aturan dengan pendekatan morfologi kata seperti awalan, akhiran, dan sisipan biasa disebut sebagai lexical rule. Penelitian ini menerapkan lexical rule hasil learner dengan menggunakan algoritma Brill Tagger. Bahasa Madura adalah bahasa daerah yang digunakan di Pulau Madura dan beberapa pulau lainnya di Jawa Timur. Objek penelitian ini menggunakan Bahasa Madura yang memiliki banyak sekali variasi afiksasi dibandingkan dengan Bahasa Indonesia. Pada penelitian ini, lexicon selain digunakan untuk pencarian kata dasar Bahasa Madura juga digunakan sebagai salah satu tahap pemberian POS Tagging. Hasil ujicoba dengan menggunakan lexicon mencapai akurasi yaitu 86.61% sedangkan jika tidak menggunakan lexicon hanya mencapai akurasi 28.95 %. Dari sini dapat disimpulkan bahwa ternyata lexicon sangat berpengaruh terhadap POS Tagging.


2021 ◽  
Vol 2021 (2) ◽  
pp. 19-23
Author(s):  
Anastasiya Ivanova ◽  
Aleksandr Kuz'menko ◽  
Rodion Filippov ◽  
Lyudmila Filippova ◽  
Anna Sazonova ◽  
...  

The task of producing a chatbot based on a neural network supposes machine processing of the text, which in turn involves using various methods and techniques for analyzing phrases and sentences. The article considers the most popular solutions and models for data analysis in the text format: methods of lemmatization, vectorization, as well as machine learning methods. Particular attention is paid to the text processing techniques, after their analyzing the best method was identified and tested.


Author(s):  
Ayush Srivastav ◽  
Hera Khan ◽  
Amit Kumar Mishra

The chapter provides an eloquent account of the major methodologies and advances in the field of Natural Language Processing. The most popular models that have been used over time for the task of Natural Language Processing have been discussed along with their applications in their specific tasks. The chapter begins with the fundamental concepts of regex and tokenization. It provides an insight to text preprocessing and its methodologies such as Stemming and Lemmatization, Stop Word Removal, followed by Part-of-Speech tagging and Named Entity Recognition. Further, this chapter elaborates the concept of Word Embedding, its various types, and some common frameworks such as word2vec, GloVe, and fastText. A brief description of classification algorithms used in Natural Language Processing is provided next, followed by Neural Networks and its advanced forms such as Recursive Neural Networks and Seq2seq models that are used in Computational Linguistics. A brief description of chatbots and Memory Networks concludes the chapter.


Author(s):  
Marina Sokolova ◽  
Stan Szpakowicz

This chapter presents applications of machine learning techniques to traditional problems in natural language processing, including part-of-speech tagging, entity recognition and word-sense disambiguation. People usually solve such problems without difficulty or at least do a very good job. Linguistics may suggest labour-intensive ways of manually constructing rule-based systems. It is, however, the easy availability of large collections of texts that has made machine learning a method of choice for processing volumes of data well above the human capacity. One of the main purposes of text processing is all manner of information extraction and knowledge extraction from such large text. Machine learning methods discussed in this chapter have stimulated wide-ranging research in natural language processing and helped build applications with serious deployment potential.


2018 ◽  
Vol 14 (3) ◽  
pp. 167-183
Author(s):  
Ahmed Ktob ◽  
Zhoujun Li

This article describes how recently, many new technologies have been introduced to the web; linked data is probably the most important. Individuals and organizations started emerging and publishing their data on the web adhering to a set of best practices. This data is published mostly in English; hence, only English agents can consume it. Meanwhile, although the number of Arabic users on the web is immense, few Arabic datasets are published. Publication catalogs are one of the primary sources of Arabic data that is not being exploited. Arabic catalogs provide a significant amount of meaningful data and metadata that are commonly stored in excel sheets. In this article, an effort has been made to help publishers easily and efficiently share their catalogs' data as linked data. Marefa is the first tool implemented that automatically extracts RDF triples from Arabic catalogs, aligns them to the BIBO ontology and links them with the Arabic chapter of DBpedia. An evaluation of the framework was conducted, and some statistical measures were generated during the different phases of the extraction process.


Author(s):  
Amir Adel Mabrouk Eldeib, Moulay Ibrahim El- Khalil Ghembaza

The science of diacritical marks is closely related to the Holy Quran, as it was used in the Quran to remove confusion and error from the pronunciation of the reader, so the introduction of any technique in the process of processing Quranic texts will have an effect on facilitating the tasks of researchers in the field of Quranic studies, whether on the reader of the Quran, to help him read accurate and correct recitation, or on the tutor to help him compile a number of examples appropriate for training. The importance of this research lies in employing automated text- processing algorithms to determine the locations of the Nunation vowelization types in the Holy Quran, and the possibility of their computerizing in order to facilitate the accurate recitation of the Holy Quran and, at the same time, to collect training examples in a database or building a corpus for future use in many research and software applications for the Holy Quran and its sciences. This research aims to present a new idea through the proposition of a framework architecture that identifies and discover automatically the locations and types of the Nunation in the Holy Quran based on the part- of- speech tagging algorithm for Arabic language so as to determine the type of words, and then by using a knowledge base to discover the appropriate Nunation words and their locations, and finally discovering the type of Nunation so as to determine the vowelization of the last letter of each Nunation word according to the Quran diacritical marks science. Furthermore, another benefit is to link searching processes with Quranic texts towards extracting the composition Nunation and the sequence Nunations in the Holy Quran emerges from the science of Quran diacritical marks; and display them as data according to a set of options selected by the user through suitable applications interfaces. The basic elements that the results of searching Quranic texts should display are highlighted, in order to extract the positions and types of Nunation vowelizations. As well as, a template for the results of searching all types of Nunation in a specific Quranic Chapter is given, with several possible options to retrieve all data in detail.


2017 ◽  
Vol 2017 ◽  
pp. 1-5
Author(s):  
Yunyu Shi ◽  
Jianfang Shan ◽  
Xiang Liu ◽  
Yongxiang Xia

Text representation is a basic issue of text information processing and event plays an important role in text understanding; both attract the attention of scholars. The event network conceals lexical relations in events, and its edges express logical relations between events in document. However, the events and relations are extracted from event-annotated text, which makes it hard for large-scale text automatic processing. In the paper, with expanded CEC (Chinese Event Corpus) as data source, prior knowledge of manifestation rules of event and relation as the guide, we propose an event extraction method based on knowledge-based rule of event manifestation, to achieve automatic building and improve text processing performance of event network.


2020 ◽  
pp. 45-58
Author(s):  
Oleksandr Ishchenko ◽  

The study analyzes speech pauses of Ukrainian. The research material is the audio texts of spontaneous conversational speech of customarily pronunciation and intonation, as well as non-spontaneous (read) speech of clear pronunciation and expressive intonation. We show a robust tendency for high frequency of pauses after nouns. It suggests that pausing is like a predictor of nouns. The frequency of pausing after verbs is slightly lower. The probability of pause location after any another part of speech is much lower. Generally, pausing can be occurred after words of any grammatical category. These findings spread virtually equally to both spontaneous conversational speech and non-spontaneous speech (clear intonated reading). The effect of nouns on pause occurrence may be caused by universal property of the human language. It is recently accepted that nouns slow down speech across structurally and culturally diverse languages. This is because nouns load cognitive processes of the speech production planning more as compared with verbs and other parts. At the same time, some Ukrainian language features also impact the pausing after nouns (these features are characteristic of other Slavic languages too). This is about a prosodic phrasing of Ukrainian according to that interpausal utterances usually are finalized by nouns (rarely by verbs or other principal parts of speech) which get most semantic load. The pauses do not follow after each noun, because they can be exploited in the speech segmentation in depends on linguistic (linguistic structure of speech), physiological (individuality of speech production, breathing), and psycholingual factors. We suggest that the priming effect as a noun- and verb-inducted psycholingual factor can significantly impact pausing in spoken language. Statistical measures show the following: 430 ms ±60% is the average pause duration of non-spontaneous clear expressive speech, 355 ms ±50% is the average pause duration of spontaneous customarily speech. Thus, pauses of non-spontaneous speech have a longer duration than of spontaneous speech. This is indicated by both the average pause duration means (ms) and the relative standard deviation of pause durations (±%). Keywords: expressive speech, spontaneous speech, phonetics, prosody, speech pauses, pausing, prepausal words, nouns, verbs.


2021 ◽  
Vol 6 (2) ◽  
pp. 110-122
Author(s):  
Siti Drivoka Sulistyaningrum ◽  
Trisya Avianka

Background: Machine translation has been proved to be a favourable style to execute. However, some research evidence difficulties indicates that its focus is on students' difficulties in academic writing, not on how to overcome them by using machine translation. As a result, this research aims to determine how machine translation might be optimized to help mechanical engineering vocational education students with academic writing difficulties. Methodology: The data was collected from 27 second-semester mechanical engineering vocational education students currently enrolled in an English college course at one of the Universities in Jakarta. Questionnaires online were used to obtain the data, which was analyzed and interpreted descriptively. Questionnaire 1 is used to determine whether or not the subject utilized machine translation and, if so, what type of machine translation they used most frequently. Question 2 was split into two sections. PART A was modified from Xiao & Chen (2015), who described students' challenges with academic writing. It comprises 12 items that were delivered to 27 students via Google Form. Meanwhile, the findings of Lee (2020) have been adapted into PART B. Findings: The result of this study revealed that 27 students of mechanical engineering vocational education in one of the Universities in Jakarta encountered several academic writing difficulties such as grammar (construct grammatically correct sentences, the use of appropriate tenses), expressions (discourse markers, part of speech), and vocabulary (proper vocabulary choices and finding synonyms). Grammar problems are the most challenging, followed by vocabulary and phrases. The optimization of machine translation was also discovered to be the most effective way of overcoming vocabulary issues followed by grammar and expression. Conclusion: Academic writing issues emerge in the classroom. According to the findings, the most difficulties students encountered fell into the grammar aspect. On the other hand, the students considered that machine translation would be the most helpful in overcoming their vocabulary challenges. Although machine translation helps deal with academic writing difficulties like developing vocabulary skills, increasing knowledge of grammar rules in context, and finding more authentic expression, teachers should also guide them in writing academically. Keywords: optimizing machine translation; academic writing difficulties; grammar; vocabulary; expressions.


Sign in / Sign up

Export Citation Format

Share Document