scholarly journals Distributional lexical semantics: Toward uniform representation paradigms for advanced acquisition and processing tasks

2010 ◽  
Vol 16 (4) ◽  
pp. 347-358 ◽  
Author(s):  
R. BASILI ◽  
M. PENNACCHIOTTI

The distributional hypothesis states that words with similar distributional properties have similar semantic properties (Harris 1968). This perspective on word semantics, was early discussed in linguistics (Firth 1957; Harris 1968), and then successfully applied to Information Retrieval (Salton, Wong and Yang 1975). In Information Retrieval, distributional notions (e.g. document frequency and word co-occurrence counts) have proved a key factor of success, as opposed to early logic-based approaches to relevance modeling (van Rijsbergen 1986; Chiaramella and Chevallet 1992; van Rijsbergen and Lalmas 1996).

2020 ◽  
Vol 73 (3) ◽  
pp. 363-401
Author(s):  
Francesca Di Garbo

AbstractNumber systems can be morphosemantic or morphosyntactic, based on whether number marking is restricted to nouns or also extends to noun-associated forms, such as adnominal modifiers, predicates, and pronouns. While it is well-known that asymmetries in the distribution of plural marking on nouns can be due to lexico-semantic properties such as animacy and/or inherent number, the question of whether these properties also affect patterns of plural agreement has been less broadly investigated. This paper examines the distribution of plural agreement in 24 Cushitic (Afro-Asiatic) languages. The number systems of the languages of the sample are classified into three types, ranging from radically morphosemantic (Type 1) to radically morphosyntactic (Type 2). A subset of languages displays a combination of morphosemantic and morphosyntactic strategies, and thus qualifies as a mixed type (Type 3). In these languages, the distribution of plural agreement is largely lexically-specified: nouns denoting groups, masses, and collections are more likely to trigger plural agreement than other types of nouns. These results thus show that, similarly to the nominal domain, the lexical semantics of nouns may also affect plural marking on noun-associated forms. Furthermore, in Cushitic, radically morphosemantic and radically morphosyntactic number systems appear to be diachronically connected to each other, with the latter seemingly evolving from the former, as testified by ongoing variation and change in some of the sampled languages. The relevance of these findings for understanding the typology and evolution of number systems is discussed.


Author(s):  
Mariani Widia Putri ◽  
Achmad Muchayan ◽  
Made Kamisutara

Sistem rekomendasi saat ini sedang menjadi tren. Kebiasaan masyarakat yang saat ini lebih mengandalkan transaksi secara online dengan berbagai alasan pribadi. Sistem rekomendasi menawarkan cara yang lebih mudah dan cepat sehingga pengguna tidak perlu meluangkan waktu terlalu banyak untuk menemukan barang yang diinginkan. Persaingan antar pelaku bisnis pun berubah sehingga harus mengubah pendekatan agar bisa menjangkau calon pelanggan. Oleh karena itu dibutuhkan sebuah sistem yang dapat menunjang hal tersebut. Maka dalam penelitian ini, penulis membangun sistem rekomendasi produk menggunakan metode Content-Based Filtering dan Term Frequency Inverse Document Frequency (TF-IDF) dari model Information Retrieval (IR). Untuk memperoleh hasil yang efisien dan sesuai dengan kebutuhan solusi dalam meningkatkan Customer Relationship Management (CRM). Sistem rekomendasi dibangun dan diterapkan sebagai solusi agar dapat meningkatkan brand awareness pelanggan dan meminimalisir terjadinya gagal transaksi di karenakan kurang nya informasi yang dapat disampaikan secara langsung atau offline. Data yang digunakan terdiri dari 258 kode produk produk yang yang masing-masing memiliki delapan kategori dan 33 kata kunci pembentuk sesuai dengan product knowledge perusahaan. Hasil perhitungan TF-IDF menunjukkan nilai bobot 13,854 saat menampilkan rekomendasi produk terbaik pertama, dan memiliki keakuratan sebesar 96,5% dalam memberikan rekomendasi pena.


Author(s):  
Eugene Santos Jr. ◽  
Hien Nguyen

In this chapter, we study and present our results on the problem of employing a cognitive user model for Information Retrieval (IR) in which a user’s intent is captured and used for improving his/her effectiveness in an information seeking task. The user intent is captured by analyzing the commonality of the retrieved relevant documents. The effectiveness of our user model is evaluated with regards to retrieval performance using an evaluation methodology which allows us to compare with the existing approaches from the information retrieval community while assessing the new features offered by our user model. We compare our approach with the Ide dec-hi approach using term frequency inverted document frequency weighting which is considered to be the best traditional approach to relevance feedback. We use CRANFIELD, CACM and MEDLINE collections which are very popular collections from the information retrieval community to evaluate relevance feedback techniques. The results show that our approach performs better in the initial runs and works competitively with Ide dec-hi in the feedback runs. Additionally, we evaluate the effects of our user modeling approach with human analysts. The results show that our approach retrieves more relevant documents to a specific analyst compared to keyword-based information retrieval application called Verity Query Language.


2014 ◽  
Vol 11 (2) ◽  
pp. 24-45 ◽  
Author(s):  
Banage T. G. S. Kumara ◽  
Incheon Paik ◽  
Wuhui Chen ◽  
Keun Ho Ryu

Clustering Web services into functionally similar clusters is a very efficient approach to service discovery. A principal issue for clustering is computing the semantic similarity between services. Current approaches use similarity-distance measurement methods such as keyword, information-retrieval or ontology based methods. These approaches have problems that include discovering semantic characteristics, loss of semantic information and a shortage of high-quality ontologies. In this paper, the authors present a method that first adopts ontology learning to generate ontologies via the hidden semantic patterns existing within complex terms. If calculating similarity using the generated ontology fails, it then applies an information-retrieval-based method. Another important issue is identifying the most suitable cluster representative. This paper proposes an approach to identifying the cluster center by combining service similarity with term frequency–inverse document frequency values of service names. Experimental results show that our term-similarity approach outperforms comparable existing approaches. They also demonstrate the positive effects of our cluster-center identification approach.


Author(s):  
Aakanksha Sharaff ◽  
Jitesh Kumar Dewangan ◽  
Dilip Singh Sisodia

Enormous records and data are gathered every day. Organization of this data is a challenging task. Topic modeling provides a way to categorize these documents, where high dimensionality of the corpus affects the result of topic model, making it important to apply feature selection or information retrieval process for dimensionality reduction. The requirement for efficient topic modeling includes the removal of unrelated words that might lead to specious coexistence of the unrelated words. This paper proposes an efficient framework for the generation of better topic coherence, where term frequency-inverse document frequency (TF-IDF) and parsimonious language model (PLM) are used for the information retrieval task. PLM extracts the important information and expels the general words from the corpus, whereas TF-IDF re-estimates the weightage of each word in the corpus. The work carried out in this paper improved the topic coherence measure to provide a better correlation among the actual topic and the topics generated from PLM.


2011 ◽  
Vol 2 (1) ◽  
Author(s):  
Yusuf Durachman

It is known that many alternatives in designing an IR system. How do we know which of these techniques are effective in which  applications? Should we use stop lists? Should we stem? Should we use in- verse document frequency weighting? Information retrieval has developed  as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections. In  this research tries to present common (although many) evaluation  of measuring the effectiveness of IR systems that widely used. and the test collections that are most often used for this purpose. Then presenst the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodol-ogy that has been developed for evaluating unranked retrieval results. This includes explaining the kinds of evaluation measures that are standardly used for document retrieval and related tasks like text clas-sification and why they are appropriate. This research can valuable for those want to do research in the field of IR. . Keyword: Information Retrieval, evaluation & measurement, Precion & Recall,


1989 ◽  
Vol 20 (1) ◽  
pp. 30-62
Author(s):  
Linda Schwartz

This paper investigates "asymmetric coordination" in Nigerian Hausa. A range of constructions is presented in which asymmetric coordination occurs, and their syntactic and semantic properties are established. A "regularizing" analysis is considered, in which asymmetric coordination is represented as a symmetric coordination headed by an empty category, but this is rejected due to the exceptional distributional properties which would have to be assumed for the construction. An interpretive analysis is proposed which has the effect of incorporating the feature information of an independent NP marked by a "linker" into a dependent plural argument, and symmetric and asymmetric coordination are distinguished as involving interpretive operations of set union and set unification, respectively.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0254937
Author(s):  
Serhad Sarica ◽  
Jianxi Luo

There are increasing applications of natural language processing techniques for information retrieval, indexing, topic modelling and text classification in engineering contexts. A standard component of such tasks is the removal of stopwords, which are uninformative components of the data. While researchers use readily available stopwords lists that are derived from non-technical resources, the technical jargon of engineering fields contains their own highly frequent and uninformative words and there exists no standard stopwords list for technical language processing applications. Here we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineering texts beyond the stopwords in general texts, based on the synthesis of alternative statistical measures such as term frequency, inverse document frequency, and entropy, and curating a stopwords dataset ready for technical language processing applications.


Sign in / Sign up

Export Citation Format

Share Document