STEMUR: An Automated Word Conflation Algorithm for the Urdu Language

Author(s):  
Tayyaba Fatima ◽  
Raees Ul Islam ◽  
Muhammad Waqas Anwar ◽  
M. Hasan Jamal ◽  
M. Tayyab Chaudhry ◽  
...  

Stemming is a common word conflation method that perceives stems embedded in the words and decreases them to their stem (root) by conflating all the morphologically related terms into a single term, without doing a complete morphological analysis. This article presents STEMUR, an enhanced stemming algorithm for automatic word conflation for Urdu language. In addition to handling words with prefixes and suffixes, STEMUR also handles words with infixes. Rather than using a totally unsupervised approach, we utilized the linguistic knowledge to develop a collection of patterns for Urdu infixes to enhance the accuracy of the stems and affixes acquired during the training process. Additionally, STEMUR also handles English loan words and can handle words with more than one affix. STEMUR is compared with four existing Urdu stemmers including Assas-Band and the template-based stemmer that are also implemented in this study. Results are processed on two corpora containing 89,437 and 30,907 words separately. Results show clear improvements regarding strength and accuracy of STEMUR. The use of maximum possible infix rules boosted our stemmer's accuracy up to 93.1% and helped us achieve a precision of 98.9%.

Werkwinkel ◽  
2015 ◽  
Vol 10 (2) ◽  
pp. 155-166
Author(s):  
Agata Kowalska-Szubert

Abstract Polish language contains hundreds of loan words from Dutch. They are rooted so firmly that they are capable of creating new words. This article presents the most common word-formation phenomena involving Dutch loan words. It also highlights their ability to form phrasemes and transfer meanings.


2021 ◽  
Vol 13 (21) ◽  
pp. 12210
Author(s):  
Manel Elmsalmi ◽  
Wafik Hachicha ◽  
Awad M. Aljuaid

The supply chain risk management (SCRM) is very critical to strategically support the firms to continuous success. There are, at least, three basic steps in this SCRM process: risk identification, risk evaluation, and risk mitigation (treatment). Whatever happens, the main step is risk mitigation (RM) and mainly sustainable RM. In fact, every risk must be eliminated or controlled as much as possible. The purpose of this paper is to elaborate and evaluate various RM scenarios from an initial risk identification and prioritization solution. The proposed scenario modeling technique is based on morphological analysis (MA) as an explorative scenario tool for RM. MA is used to develop a framework to proactively assess critical risk variables. Firstly, MA is employed to exhaustively create possible RM scenarios and, secondly, to assess the likelihood of each scenario. The proposed approach addresses the need for a basic rubric to help identify and choose RM approaches. A real case study is provided from the food industry to illustrate the application of the proposed approach. To handle all possible MA strategies, a dedicated MORPHOL software package is used. In addition, RM strategies are selected based on sustainability indicators. The case study results prove that MA has a considerable value for SCRM. It shows that firms can adopt multiple robust strategies in the form of a scenario describing all stages of SCRM in an integrated representation.


2019 ◽  
Author(s):  
Francis M. Tyers ◽  
Jonathan N. Washington ◽  
Darya Kavitskaya ◽  
Memduh Gökırmak

This paper describes a weighted finite-state morphological transducer for Crimean Tatar able to analyse and generate in both Latin and Cyrillic orthographies. This transducer was developed by a team including a community member and language expert, a field linguist who works with the community, a Turkologist with computational linguistics expertise, and an experienced computational linguist with Turkic expertise. Dealing with two orthographic systems in the same transducer is challenging as they employ different strategies to deal with the spelling of loan words and encode the full range of the language's phonemes and their interaction. We develop the core transducer using the Latin orthography and then design a separate transliteration transducer to map the surface forms to Cyrillic. To help control the non-determinism in the orthographic mapping, we use weights to prioritise forms seen in the corpus. We perform an evaluation of all components of the system, finding an accuracy above 90% for morphological analysis and near 90% for orthographic conversion. This comprises the state of the art for Crimean Tatar morphological modelling, and, to our knowledge, is the first biscriptual single morphological transducer for any language.


1996 ◽  
Vol 2 (4) ◽  
pp. 367-368
Author(s):  
B. SRINIVAS

There are currently two philosophies for building grammars and parsers: hand-crafted, wide coverage grammars; and statistically induced grammars and parsers. Aside from the methodological differences in grammar construction, the linguistic knowledge which is overt in the rules of handcrafted grammars is hidden in the statistics derived by probabilistic methods, which means that generalizations are also hidden and the full training process must be repeated for each domain. Although handcrafted wide coverage grammars are portable, they can be made more efficient when applied to limited domains, if it is recognized that language in limited domains is usually well constrained and certain linguistic constructions are more frequent than others. We view a domain-independent grammar as a repository of portable grammatical structures whose combinations are to be specialized for a given domain. We use Explanation-Based Learning (EBL) to identify the relevant subset of a handcrafted general purpose grammar (XTAG) needed to parse in a given domain (ATIS). We exploit the key properties of Lexicalized Tree-Adjoining Grammars to view parsing in a limited domain as finite state transduction from strings to their dependency structures.


2020 ◽  
Vol 7 (2) ◽  
Author(s):  
Itzhak OMER ◽  
Orna ZAFRIR-REUVEN

Street patterns of Israeli cities were investigated by comparing three time periods of urban development: (I) the late 19th century until the establishment of the state of Israel in 1948; (II) 1948 until the 1980s; and (III) the late 1980s until the present. These time periods are related respectively to the pre-modern, modern and late-modern urban planning approach. Representative urban street networks were examined in selected cities by means of morphological analysis of typical street pattern properties: curvature, fragmentation, connectivity, continuity and differentiation. The study results reveal significant differences between the street patterns of the three examined periods in the development of cities in Israel. The results show clearly the gradual trends in the intensification of curvature, fragmentation, complexity and hierarchical organization of street networks as well as the weakening of the network's internal and external connectivity. The implications of these changes on connectivity and spatial integration are discussed with respect to planning approaches.


2020 ◽  
Author(s):  
Shao Yang ◽  
Tsair-Wei Chien

BACKGROUND The recent article published on November 27 in 2020 is well-written but remains several questionable issues that are required to clarifications further, particularly for readers who hope to replicate this study using a longer period of months instead of the original days from March 11 to May 19, 2020. OBJECTIVE Redo the study using a longer period of time to examine the difference from and similarity to the previous study and present results using visual representations. METHODS Similar search schemes were compared to the golden standard(LitCovid) using three metrics of sensitivity, precision, and F-score. We applied similar search schemes to extract publications related to COVID-19 from January to November in Pubmed Central(PMC). The Kano model was applied to present the study results divided into three groups of high sensitivity, high precision, and neutral. Comparison of publication counts was made using the line plot to display. RESULTS We observed that the comprehensive search scheme recommended by the original authors was ranked at the third placement instead of the first one shown in this study. A small number of articles extracted from the PMC were attributable to the reasons for schemes with (1) only one keyword of coronavirus, (2) that totally constrained by Wuhan, and (3) that hyphen and space misused in keyword terms. Scheme 2, authored by Shokraneh in the journal of BMJ, was ranked first, followed by Schemes 9 and 1 with F-scores at 97.9, 90.2, and 87.3, respectively. The single-term search COVID-19 performed best in terms of precision (99.9%) but not well in terms of sensitivity (76.6%) and F-score (86.7%). The term Wuhan virus performed the worst: 24.2% for sensitivity, 90.9% for precision, and 138.2% for F-score due to the reason for using AND condition in the search string. All 32 schemes were compared and displayed on the Kano diagram. CONCLUSIONS Different results were displayed using similar search schemes with a longer period of time from January to November in 2020. Scheme 2 is recommended for the bibliometric study related to COVID-19 in the future. The Kano diagram can be a visual display to compare search schemes based on precision(on Axis X), sensitivity(on Axis Y), and F-score(by bubble size) laid on a dashboard. CLINICALTRIAL Nil


1987 ◽  
Vol 10 (1) ◽  
pp. 1-34 ◽  
Author(s):  
Mats Eeg-Olofsson

Representative sets of software systems for computational morphology are evaluated as candiates for a general morphological program module in the context of computer-aided word class tagging. They are considered as both programming tools and representations of linguistic Knowledge. The systems, which are found to be relatively neutral with respect to linguistic theory, can be grouped into a general-purpose and a special-purpose type. Pattern matching in them is described as a high-level feature applied to the computational treatment of phenomena characteristic of morphological analysis: lexical lookup, morphotactics, and morphophonemic alternation. The systems are found to perform similarly in simple applications, but significantly differently in more complicated ones where integrated and well-structured solutions are sought.


Languages ◽  
2021 ◽  
Vol 6 (3) ◽  
pp. 131
Author(s):  
Begoña Arechabaleta Regulez ◽  
Silvina Montrul

Spanish marks animate and specific direct objects overtly with the preposition a, an instance of Differential Object Marking (DOM). However, in some varieties of Spanish, DOM is advancing to inanimate objects. Language change starts at the individual level, but how does it start? What manifestation of linguistic knowledge does it affect? This study traced this innovative use of DOM in oral production, grammaticality judgments and on-line comprehension (reading task with eye-tracking) in the Spanish of Mexico. Thirty-four native speakers (ages 18–22) from the southeast of Mexico participated in the study. Results showed that the incidence of the innovative use of DOM with inanimate objects varied by task: DOM innovations were detected in on-line processing more than in grammaticality judgments and oral production. Our results support the hypothesis that language variation and change may start with on-line comprehension.


2021 ◽  
Author(s):  
◽  
Cailing Lu

<p>This research investigates the nature of vocabulary, especially technical vocabulary, in the specialized discipline of Traditional Chinese medicine (TCM), which is an important area of higher education. It consists of three linked studies in correspondence to three research aims using a combination of quantitative and qualitative methods. Study 1 addressed the questions of what kinds of words constitute TCM lexis given its origin, and what is the vocabulary load of English-medium texts in this discipline. To answer these questions, a series of lexical analyses was conducted on three corpora: theory-based and practice-based textbook corpora and a journal article corpus, which reflect the main areas of reading for TCM students. The results showed that while high, mid and low-frequency vocabulary make up a fairly large proportion of these texts, other lexical items such as abbreviations, loan words, medical words, proper nouns, and compounds also feature in them, but in differing proportions depending on the text types. Further, this study found that a large vocabulary of 13,000 word families plus four supplementary lists and two TCM-specific lists is needed. This is the point which most TCM learners can read TCM textbooks and journal articles without vocabulary being a handicap.  Study 2 looked more closely at the technical vocabulary in TCM. The nature of technical vocabulary was explored and TCM technical word lists of both single and multiword units were developed for learners and teachers in this discipline. A total of 2,778 word types were selected for the TCM technical word list based on the criteria of relative keyness in the TCM Corpora compared to a general written English corpus, meaningfulness, and frequency. The list provided 36.65% coverage of the corpora from which it was developed. In addition, a TCM technical lexical bundle list with 898 bundles was developed to supplement the technical word list. The findings suggested that lexical bundles play an essential role in creating meaning and structure of TCM discourse. Thus, they should be regarded as a basic linguistic construct since some technical vocabulary needs to be seen in bundles rather than in single words.  The last study bridged the gap between corpus-based word lists and the actual ESP vocabulary learning context by way of investigating learners’ understanding of the technical words from the technical word list generated from the second study. Results suggested that learners faced different challenges in technical vocabulary learning depending on their linguistic backgrounds. Specifically, Chinese learners had great difficulty with technical words from the lower-frequency bands of BNC/COCA word lists, while Western learners encountered challenges with loan words borrowed from Chinese. As a result, a certain divergence between the Western and Chinese TCM learners’ understanding of technical words was manifested. These findings indicate that a pedagogically useful word list should be adaptable to learners from different linguistic backgrounds.  Drawing on these findings, this thesis also provides methodological, theoretical, and pedagogical implications so that the TCM learners can gain better support in their specialized English vocabulary learning. They can also enable the teachers and course designers to better scaffold their students’ vocabulary development.</p>


Linguistica ◽  
2014 ◽  
Vol 54 (1) ◽  
pp. 471-484
Author(s):  
Viet anh Nguyen

In today’s globalized world, it seems necessary, or even indispensable for the teaching/learning of foreign languages to be based on international standards proposed by the Common European Framework of Reference for Languages (CEFRL). The present article deals with issues of integration of the CEFRL in the Vietnamese context by analyzing the results of a study of training programs at six universities specializing in foreign languages, which are based in three regions of the country (Northern, Central and Southern Vietnam). Despite some positive changes and the dynamism characteristic of the approach, a mechanical and rigid introduction of CEFRL in foreign-language universities in Vietnam has actually caused several problems. These include (1) the inconsistency between the levels established by the CEFRL and the organization of teaching/learning; (2) the risk of teaching/learning becoming too “utilitarian” and too function-oriented and (3) excessive attention given to the evaluation and assessment of linguistic knowledge and of performance level  rather than on the ability to use various resources as well as to long-term process of competence development. The study results show some possible ways for the development of a referential frame for learning/teaching French in Vietnam.


Sign in / Sign up

Export Citation Format

Share Document