Quantitative Semantics and Soft Computing Methods for the Web
Latest Publications


TOTAL DOCUMENTS

11
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781609608811, 9781609608828

Author(s):  
Anna Lisa Gentile ◽  
Ziqi Zhang ◽  
Fabio Ciravegna

This chapter proposes a novel Semantic Relatedness (SR) measure that exploits diverse features extracted from a knowledge resource. Computing SR is a crucial technique for many complex Natural Language Processing (NLP) as well as Semantic Web related tasks. Typically, semantic relatedness measures only make use of limited number of features without considering diverse feature sets or understanding the different contributions of features to the accuracy of a method. This chapter proposes a random graph walk model based method that naturally combines diverse features extracted from a knowledge resource in a balanced way in the task of computing semantic relatedness. A set of experiments is carefully designed to investigate the effects of choosing different features and altering their weights on the accuracy of the system. Next, using the derived feature sets and feature weights we evaluate the proposed method against the state-of-the-art semantic relatedness measures, and show that it obtains higher accuracy on many benchmarking datasets. Additionally, the authors justify the usefulness of the proposed method in a practical NLP task, i.e. Named Entity Disambiguation.


Author(s):  
Eduardo H. Ramírez ◽  
Ramón F. Brena

Finally, in order to compare the QTM results with models generated by other methods we have developed probabilistic metrics that formalize the notion of semantic coherence using probabilistic concepts and can be used to validate overlapping and incomplete clustering using multi-labeled corpora. They show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.


Author(s):  
René Arnulfo García-Hernández ◽  
J. Fco. Martínez-Trinidad ◽  
J. Ariel Carrasco-Ochoa

This chapter introduces maximal sequential patterns, how to extract them, and some applications of maximal sequential patterns for document processing and web content mining. The main objective of this chapter is showing that maximal sequential patterns preserve document semantic, and therefore they could be a good alternative to the word and n-gram models. First, this chapter introduces the problem of maximal sequential pattern mining when the data are sequential chains of words. After, it defines several basic concepts and the problem of maximal sequential pattern mining in text documents. Then, it presents two algorithms proposed by the authors of this chapter for efficiently finding maximal sequential patterns in text documents. Additionally, it describes the use of maximal sequential patterns as a quantitative semantic tool for solving different problems related to document processing and web content mining. Finally, it shows some future research directions and conclusions.


Author(s):  
Gloria Bordogna ◽  
Alessandro Campi ◽  
Giuseppe Psaila ◽  
Stefania Ronchi

In this chapter, the authors propose a novel multi-granular framework for visualization and exploration of the results of a complex search process, performed by a user by submitting several queries to possibly distinct search engines. The primary aim of the approach is to supply users with summaries, with distinct levels of details, of the results for a search process. It applies dynamic clustering to the results in each ordered list retrieved by a search engine evaluating a user’s query. The single retrieved items, the clusters so identified, and the single retrieved lists, are considered as dealing with topics at distinct levels of granularity, from the finest level to the coarsest one, respectively. Implicit topics are revealed by associating labels with the retrieved items, the clusters, and the retrieved lists. Then, some manipulation operators, defined in this chapter, are applied to each pair of retrieved lists, clusters, and single items, to reveal their implicit relationships. These relationships have a semantic nature, since they are labeled to approximately represent the shared documents and the shared sub-topics between each pair of combined elements. Finally, both the topics retrieved by the distinct searches and their relationships are represented through multi-granular graphs, that represent the retrieved topics at three distinct levels of granularity. The exploration of the results can be performed by expanding the graphs nodes to see their contents, and by expanding the edges to see their shared contents and their common sub-topics.


Author(s):  
Davide Magatti ◽  
Fabio Stella

A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.


Author(s):  
Pavel Makagonov ◽  
Celia B.Reyes E. ◽  
Grigori Sidorov

The main idea of the authors’ research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a website. They use as a case study the digital library of the website on Mixtec culture that we maintain. The authors propose using the concept of the text document search image (TDSI). For creating TDSIs they make analysis of word frequencies in the documents and distinguish between the Zipf’s distribution that is typical for meaningful words and distributions approximated by an ellipse typical for auxiliary words. The authors also describe some analogies of these distributions in architecture and in urban planning. We describe a toolkit DDL that allows for TDSI creation and show its application for the mentioned website and for the corpus of dialogs with railway office information system.


Author(s):  
Adolfo Guzman-Arenas ◽  
Alma-Delia Cuevas

All observers are equally credible, so differences in their findings arise from perception errors.


Author(s):  
Hiram Calvo ◽  
Kentaro Inui ◽  
Yuji Matsumoto

Learning verb argument preferences has been approached as a verb and argument problem, or at most as a tri-nary relationship between subject, verb and object. However, the simultaneous correlation of all arguments in a sentence has not been explored thoroughly for sentence plausibility mensuration because of the increased number of potential combinations and data sparseness. In this work the authors present a review of some common methods for learning argument preferences beginning with the simplest case of considering binary co-relations, then they compare with tri-nary co-relations, and finally they consider all arguments. For this latter, the authors use an ensemble model for machine learning using discriminative and generative models, using co-occurrence features, and semantic features in different arrangements. They seek to answer questions about the number of optimal topics required for PLSI and LDA models, as well as the number of co-occurrences that should be required for improving performance. They explore the implications of using different ways of projecting co-relations, i.e., into a word space, or directly into a co-occurrence features space. The authors conducted tests using a pseudo-disambiguation task learning from large corpora extracted from Internet.


Author(s):  
Zhihua Wei ◽  
Duoqian Miao ◽  
Ruizhi Wang ◽  
Zhifei Zhang

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.


Author(s):  
Sara Elena Garza Villarreal ◽  
Ramón Brena

An extensive review of existing literature on clustering and topic mining is given throughout the chapter as well.


Sign in / Sign up

Export Citation Format

Share Document