Semantic Relatedness Estimation using the Layout Information of Wikipedia Articles

Author(s):  
Patrick Chan ◽  
Yoshinori Hijikata ◽  
Toshiya Kuramochi ◽  
Shogo Nishida

Computing the semantic relatedness between two words or phrases is an important problem in fields such as information retrieval and natural language processing. Explicit Semantic Analysis (ESA), a state-of-the-art approach to solve the problem uses word frequency to estimate relevance. Therefore, the relevance of words with low frequency cannot always be well estimated. To improve the relevance estimate of low-frequency words and concepts, the authors apply regression to word frequency, its location in an article, and its text style to calculate the relevance. The relevance value is subsequently used to compute semantic relatedness. Empirical evaluation shows that, for low-frequency words, the authors’ method achieves better estimate of semantic relatedness over ESA. Furthermore, when all words of the dataset are considered, the combination of the authors’ proposed method and the conventional approach outperforms the conventional approach alone.

Author(s):  
Khaoula Mrhar ◽  
Mounia Abik

Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.


Author(s):  
Radha Guha

Background:: In the era of information overload it is very difficult for a human reader to make sense of the vast information available in the internet quickly. Even for a specific domain like college or university website it may be difficult for a user to browse through all the links to get the relevant answers quickly. Objective:: In this scenario, design of a chat-bot which can answer questions related to college information and compare between colleges will be very useful and novel. Methods:: In this paper a novel conversational interface chat-bot application with information retrieval and text summariza-tion skill is designed and implemented. Firstly this chat-bot has a simple dialog skill when it can understand the user query intent, it responds from the stored collection of answers. Secondly for unknown queries, this chat-bot can search the internet and then perform text summarization using advanced techniques of natural language processing (NLP) and text mining (TM). Results:: The advancement of NLP capability of information retrieval and text summarization using machine learning tech-niques of Latent Semantic Analysis(LSI), Latent Dirichlet Allocation (LDA), Word2Vec, Global Vector (GloVe) and Tex-tRank are reviewed and compared in this paper first before implementing them for the chat-bot design. This chat-bot im-proves user experience tremendously by getting answers to specific queries concisely which takes less time than to read the entire document. Students, parents and faculty can get the answers for variety of information like admission criteria, fees, course offerings, notice board, attendance, grades, placements, faculty profile, research papers and patents etc. more effi-ciently. Conclusion:: The purpose of this paper was to follow the advancement in NLP technologies and implement them in a novel application.


2019 ◽  
Vol 53 (2) ◽  
pp. 3-10
Author(s):  
Muthu Kumar Chandrasekaran ◽  
Philipp Mayr

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.


2021 ◽  
Vol 47 (05) ◽  
Author(s):  
NGUYỄN CHÍ HIẾU

Knowledge Graphs are applied in many fields such as search engines, semantic analysis, and question answering in recent years. However, there are many obstacles for building knowledge graphs as methodologies, data and tools. This paper introduces a novel methodology to build knowledge graph from heterogeneous documents.  We use the methodologies of Natural Language Processing and deep learning to build this graph. The knowledge graph can use in Question answering systems and Information retrieval especially in Computing domain


2011 ◽  
Vol 29 (2) ◽  
pp. 1-34 ◽  
Author(s):  
Ofer Egozi ◽  
Shaul Markovitch ◽  
Evgeniy Gabrilovich

2020 ◽  
Author(s):  
Yuqi Kong ◽  
Fanchao Meng ◽  
Ben Carterette

Comparing document semantics is one of the toughest tasks in both Natural Language Processing and Information Retrieval. To date, on one hand, the tools for this task are still rare. On the other hand, most relevant methods are devised from the statistic or the vector space model perspectives but nearly none from a topological perspective. In this paper, we hope to make a different sound. A novel algorithm based on topological persistence for comparing semantics similarity between two documents is proposed. Our experiments are conducted on a document dataset with human judges’ results. A collection of state-of-the-art methods are selected for comparison. The experimental results show that our algorithm can produce highly human-consistent results, and also beats most state-of-the-art methods though ties with NLTK.


1999 ◽  
Vol 26 (2) ◽  
pp. 261-294 ◽  
Author(s):  
JUDITH A. GIERUT ◽  
MICHELE L. MORRISETTE ◽  
ANNETTE HUST CHAMPION

Lexical diffusion, as characterized by interword variation in production, was examined in phonological acquisition. The lexical variables of word frequency and neighbourhood density were hypothesized to facilitate sound change to varying degrees. Twelve children with functional phonological delays, aged 3;0 to 7;4, participated in an alternating treatments experiment to promote sound change. Independent variables were crossed to yield all logically possible combinations of high/low frequency and high/low density in treatment; the dependent measure was generalization accuracy in production. Results indicated word frequency was most facilitative in sound change, whereas, dense neighbourhood structure was least facilitative. The salience of frequency and avoidance of high density are discussed relative to the type of phonological change being induced in children's grammars, either phonetic or phonemic, and to the nature of children's representations. Results are further interpreted with reference to interactive models of language processing and optimality theoretic accounts of linguistic structure.


2011 ◽  
Vol 23 (9) ◽  
pp. 2432-2446 ◽  
Author(s):  
Paul Hoffman ◽  
Timothy T. Rogers ◽  
Matthew A. Lambon Ralph

Word frequency is a powerful predictor of language processing efficiency in healthy individuals and in computational models. Puzzlingly, frequency effects are often absent in stroke aphasia, challenging the assumption that word frequency influences the behavior of any computational system. To address this conundrum, we investigated divergent effects of frequency in two comprehension-impaired patient groups. Patients with semantic dementia have degraded conceptual knowledge as a consequence of anterior temporal lobe atrophy and show strong frequency effects. Patients with multimodal semantic impairments following stroke (semantic aphasia [SA]), in contrast, show little or no frequency effect. Their deficits arise from impaired control processes that bias activation toward task-relevant aspects of knowledge. We hypothesized that high-frequency words exert greater demands on cognitive control because they are more semantically diverse—they tend to appear in a broader range of linguistic contexts and have more variable meanings. Using latent semantic analysis, we developed a new measure of semantic diversity that reflected the variability of a word's meaning across different context. Frequency, but not diversity, was a significant predictor of comprehension in semantic dementia, whereas diversity was the best predictor of performance in SA. Most importantly, SA patients did show typical frequency effects but only when the influence of diversity was taken into account. These results are consistent with the view that higher-frequency words place higher demands on control processes, so that when control processes are damaged the intrinsic processing advantages associated with higher-frequency words are masked.


Sign in / Sign up

Export Citation Format

Share Document