Word Embedding Based on Large-Scale Web Corpora as a Powerful Lexicographic Tool

The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.

Download Full-text

The forms and meanings of grammatical markers support efficient communication

10.31234/osf.io/vf3zj ◽

2021 ◽

Author(s):

Francis Mollica ◽

Geoffrey Iain Bacon ◽

Noga Zaslavsky ◽

Yang Xu ◽

Terry Regier ◽

...

Keyword(s):

Semantic Space ◽

General Information ◽

Efficient Production ◽

Theoretic Analysis ◽

Word Meanings ◽

Information Theoretic ◽

Efficient Communication ◽

Word Forms ◽

Feature Values ◽

Semantic Typology

Functionalist accounts of language suggest that forms are paired with meanings in ways that support efficient communication. Previous work on grammatical marking suggests that word forms have lengths that enable efficient production, and work on the semantic typology of the lexicon suggests that word meanings represent efficient partitions of semantic space. Here we establish a theoretical link between these two lines of work and present an information-theoretic analysis that captures how communicative pressures influence both form and meaning. We apply our approach to the grammatical features of number, tense, and evidentiality, and show that the approach explains both which systems of feature values are attested across languages and the relative lengths of the forms for those feature values. Our approach shows that general information-theoretic principles can capture variation in both form and meaning, across both grammar and the lexicon.

Download Full-text

Analysis of Lecxico-Semantic Relations of Punjabi Shahmukhi Nouns: A Corpus Based Study

International Journal of English Linguistics ◽

10.5539/ijel.v9n3p357 ◽

2019 ◽

Vol 9 (3) ◽

pp. 357

Author(s):

Muhammad Ahmad Hashmi ◽

Muhammad Asim Mahmood ◽

Muhammad Ilyas Mahmood

Keyword(s):

Automatic Speech Recognition ◽

Semantic Relations ◽

Read Aloud ◽

Report Generation ◽

Word Meanings ◽

The World ◽

Word Forms ◽

Open Class ◽

International Status ◽

Different Sources

The current study is an effort in the development of Lecxico-semantic relations among Punjabi Shahmukhi nouns. Semantic relations are those nets, which are found among nouns on the bases of word meanings. Development of semantic nets is taken as a key part while developing WordNet of any language. The WordNet of Punjabi Shahmukhi is not developed yet. The digital exposure and progress of Punjabi Shahmukhi is very slow in comparison to other languages of the world. The present study explores the kind of semantic relations found among the nouns of Punjabi Shahmukhi. WordNet organizes words on the basis of word meanings rather than word forms. WordNet of English includes four open class categories including; nouns, verbs, adverbs and adjectives, but present study is limited to the analysis of nouns. A corpus of 2 million words of Punjabi Shahmukhi was taken from different sources. Then, it was POS tagged and a list of 846 nouns was generated. Then, each noun was analyzed individually to develop its Lecxico-semantic relations including: synonymy, antonymy, meronyms, holonymy, hyponymy, hypernymy, singular, plural, masculine, feminine and HAS a part. The present research is significant and useful in the development of WordNet for Punjabi Shahmukhi. With the development of WordNet, it will be possible to run digital applications in Punjabi Shahmukhi including: machine translation, information retrieval, querying archive and report generation to automatic speech recognition, data mining, read aloud, robotics and many more. On the other hand, WordNet will help to maintain an international status for Punjabi Shahmukhi.

Download Full-text

The forms and meanings of grammatical markers support efficient communication

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2025993118 ◽

2021 ◽

Vol 118 (49) ◽

pp. e2025993118

Author(s):

Francis Mollica ◽

Geoff Bacon ◽

Noga Zaslavsky ◽

Yang Xu ◽

Terry Regier ◽

...

Keyword(s):

Semantic Space ◽

General Information ◽

Efficient Production ◽

Theoretic Analysis ◽

Word Meanings ◽

Information Theoretic ◽

Efficient Communication ◽

Word Forms ◽

Feature Values ◽

Semantic Typology

Functionalist accounts of language suggest that forms are paired with meanings in ways that support efficient communication. Previous work on grammatical marking suggests that word forms have lengths that enable efficient production, and work on the semantic typology of the lexicon suggests that word meanings represent efficient partitions of semantic space. Here we establish a theoretical link between these two lines of work and present an information-theoretic analysis that captures how communicative pressures influence both form and meaning. We apply our approach to the grammatical features of number, tense, and evidentiality and show that the approach explains both which systems of feature values are attested across languages and the relative lengths of the forms for those feature values. Our approach shows that general information-theoretic principles can capture variation in both form and meaning across languages.

Download Full-text

Part-of-speech tagging for web search queries using a large-scale web corpus

Proceedings of the Symposium on Applied Computing - SAC '17 ◽

10.1145/3019612.3019694 ◽

2017 ◽

Cited By ~ 1

Author(s):

Atsushi Keyaki ◽

Jun Miyazaki

Keyword(s):

Large Scale ◽

Web Search ◽

Search Queries ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Why Words Are Hard for Adults With Developmental Language Impairments

Journal of Speech Language and Hearing Research ◽

10.1044/1092-4388(2013/12-0233) ◽

2013 ◽

Vol 56 (6) ◽

pp. 1845-1856 ◽

Cited By ~ 33

Author(s):

Karla K. McGregor ◽

Ulla Licandro ◽

Richard Arenas ◽

Nichole Eden ◽

Derek Stiles ◽

...

Keyword(s):

Language Impairment ◽

Normal Development ◽

Declarative Memory ◽

Poor Performance ◽

Language Impairments ◽

Learning Problems ◽

Word Form ◽

Word Meanings ◽

Word Forms ◽

Similarities And Differences

Purpose To determine whether word learning problems associated with developmental language impairment (LI) reflect deficits in encoding or subsequent remembering of forms and meanings. Method Sixty-nine 18- to 25-year-olds with LI or without (the normal development [ND] group) took tests to measure learning of 16 word forms and meanings immediately after training (encoding) and 12 hr, 24 hr, and 1 week later (remembering). Half of the participants trained in the morning, and half trained in the evening. Results At immediate posttest, participants with LI performed more poorly on form and meaning than those with ND. Poor performance was more likely among those with more severe LI. The LI–ND gap for word form recall widened over 1 week. In contrast, the LI and ND groups demonstrated no difference in remembering word meanings over the week. In both groups, participants who trained in the evening, and therefore slept shortly after training, demonstrated greater gains in meaning recall than those who trained in the morning. Conclusions Some adults with LI have encoding deficits that limit the addition of word forms and meanings to the lexicon. Similarities and differences in patterns of remembering in the LI and ND groups motivate the hypothesis that consolidation of declarative memory is a strength for adults with LI.

Download Full-text

Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way

Electronics ◽

10.3390/electronics10121372 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1372

Author(s):

Sanjanasri JP ◽

Vijay Krishna Menon ◽

Soman KP ◽

Rajendran S ◽

Agnieszka Wolk

Keyword(s):

Deep Learning ◽

Language Processing ◽

Semantic Space ◽

Semantic Interpretation ◽

Learning Approaches ◽

Qualitative Comparison ◽

Bilingual Dictionary ◽

Pos Tagging ◽

Part Of Speech ◽

Cross Lingual

Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.

Download Full-text

Adaptive Word Embedding Module for Semantic Reasoning in Large-scale Detection

2020 25th International Conference on Pattern Recognition (ICPR) ◽

10.1109/icpr48806.2021.9412094 ◽

2021 ◽

Author(s):

Yu Zhang ◽

Xiaoyu Wu ◽

Ruolin Zhu

Keyword(s):

Large Scale ◽

Word Embedding ◽

Semantic Reasoning ◽

Scale Detection

Download Full-text

Augmenting Bug Localization with Part-of-Speech and Invocation

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194017500346 ◽

2017 ◽

Vol 27 (06) ◽

pp. 925-949 ◽

Cited By ~ 5

Author(s):

Yu Zhou ◽

Yanxiang Tong ◽

Taolue Chen ◽

Jin Han

Keyword(s):

Software Maintenance ◽

Large Scale ◽

Bug Localization ◽

Bug Reports ◽

Part Of Speech ◽

Adaptive Technique ◽

Bug Report ◽

Software Maintenance And Evolution ◽

Speech Features ◽

Localization Approach

Bug localization represents one of the most expensive, as well as time-consuming, activities during software maintenance and evolution. To alleviate the workload of developers, numerous methods have been proposed to automate this process and narrow down the scope of reviewing buggy files. In this paper, we present a novel buggy source-file localization approach, using the information from both the bug reports and the source files. We leverage the part-of-speech features of bug reports and the invocation relationship among source files. We also integrate an adaptive technique to further optimize the performance of the approach. The adaptive technique discriminates Top 1 and Top N recommendations for a given bug report and consists of two modules. One module is to maximize the accuracy of the first recommended file, and the other one aims at improving the accuracy of the fixed defect file list. We evaluate our approach on six large-scale open source projects, i.e. ASpectJ, Eclipse, SWT, Zxing, Birt and Tomcat. Compared to the previous work, empirical results show that our approach can improve the overall prediction performance in all of these cases. Particularly, in terms of the Top 1 recommendation accuracy, our approach achieves an enhancement from 22.73% to 39.86% for ASpectJ, from 24.36% to 30.76% for Eclipse, from 31.63% to 46.94% for SWT, from 40% to 55% for ZXing, from 7.97% to 21.99% for Birt, and from 33.37% to 38.90% for Tomcat.

Download Full-text

Towards a Historical Treebank of Middle and Early Modern Welsh, Part I: Workflow and POS Tagging

Journal of Celtic Linguistics ◽

10.16922/jcl.22.6 ◽

2021 ◽

Vol 22 (1) ◽

pp. 125-154

Author(s):

Marieke Meelen ◽

David Willis

Keyword(s):

Early Modern ◽

Pos Tagging ◽

Part Of Speech ◽

European Languages ◽

Welsh Language ◽

Word Boundaries ◽

Exhaustive Extraction ◽

Grammatical Structures

This article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.

Download Full-text

Coordinating nominal compounds: Universal vs. areal tendencies

Linguistics ◽

10.1515/ling-2018-0025 ◽

2018 ◽

Vol 56 (6) ◽

pp. 1197-1243 ◽

Cited By ~ 3

Author(s):

Giorgio Francesco Arcodia

Keyword(s):

Cultural Context ◽

The Other ◽

Irian Jaya ◽

Oxford University ◽

European Languages ◽

Word Forms ◽

Nominal Compounds ◽

Noun Compounds ◽

The One ◽

Referential Function

AbstractCoordinating compounds, i.e. complex word forms in which the constituent lexemes are in a coordination relation, may be divided into two classes: hyperonymic, in which the referent of the whole compound is the “sum” of the meanings of the constituent lexemes (Korowaiyumdefól‘(her) husband-wife, couple’; van Enk, Gerrit J., & Lourens de Vries. 1997.The Korowai of Irian Jaya: Their language in its cultural context. Oxford: Oxford University Press: 66), and hyponymic, where the compound designates a single referent having features of all the constituents (Englishactor-director). It has been proposed that languages choose either type as the one with the “tightest” marking pattern; whereas the crosslinguistic tendency is to have tighter hyperonymic compounds, most languages of Europe rather have tighter hyponymic compounds (Arcodia, Giorgio Francesco, Nicola Grandi, & Bernhard Wälchli 2010. Coordination in compounding. In Sergio Scalise & Irene Vogel (eds.),Cross-disciplinary issues in compounding, 177–198. Amsterdam & Philadelphia: John Benjamins). In this paper, we will test this assumption on noun-noun compounds in a sample of 20 Standard Average European languages and in a balanced sample of 60 non-SAE languages, arguing that the preference for hyperonymic compounds is best explained by the default referential function of nouns; in hyponymic compounds, on the other hand, nouns are used to indicate properties. We will then compare nominal and adjectival coordinating compounds, showing that for the latter the hyponymic compounding pattern is the dominant one, as adjectives are prototypical property-denoting words.

Download Full-text