Quantitative Semantics and Soft Computing Methods for the Web

This chapter proposes a novel Semantic Relatedness (SR) measure that exploits diverse features extracted from a knowledge resource. Computing SR is a crucial technique for many complex Natural Language Processing (NLP) as well as Semantic Web related tasks. Typically, semantic relatedness measures only make use of limited number of features without considering diverse feature sets or understanding the different contributions of features to the accuracy of a method. This chapter proposes a random graph walk model based method that naturally combines diverse features extracted from a knowledge resource in a balanced way in the task of computing semantic relatedness. A set of experiments is carefully designed to investigate the effects of choosing different features and altering their weights on the accuracy of the system. Next, using the derived feature sets and feature weights we evaluate the proposed method against the state-of-the-art semantic relatedness measures, and show that it obtains higher accuracy on many benchmarking datasets. Additionally, the authors justify the usefulness of the proposed method in a practical NLP task, i.e. Named Entity Disambiguation.

Download Full-text

Query Based Topic Modeling

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch004 ◽

2011 ◽

pp. 69-95

Author(s):

Eduardo H. Ramírez ◽

Ramón F. Brena

Keyword(s):

Topic Modeling ◽

State Of The Art ◽

Probabilistic Methods ◽

Semantic Coherence ◽

Probabilistic Metrics

Finally, in order to compare the QTM results with models generated by other methods we have developed probabilistic metrics that formalize the notion of semantic coherence using probabilistic concepts and can be used to validate overlapping and incomplete clustering using multi-labeled corpora. They show that the proposed method can produce models of comparable, or even superior quality, than those produced with state of the art probabilistic methods.

Download Full-text

Maximal Sequential Patterns

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch010 ◽

2011 ◽

pp. 204-227

Author(s):

René Arnulfo García-Hernández ◽

J. Fco. Martínez-Trinidad ◽

J. Ariel Carrasco-Ochoa

Keyword(s):

Pattern Mining ◽

Sequential Pattern Mining ◽

Sequential Pattern ◽

Sequential Patterns ◽

Document Processing ◽

Web Content ◽

Text Documents ◽

Web Content Mining ◽

Content Mining ◽

N Gram

This chapter introduces maximal sequential patterns, how to extract them, and some applications of maximal sequential patterns for document processing and web content mining. The main objective of this chapter is showing that maximal sequential patterns preserve document semantic, and therefore they could be a good alternative to the word and n-gram models. First, this chapter introduces the problem of maximal sequential pattern mining when the data are sequential chains of words. After, it defines several basic concepts and the problem of maximal sequential pattern mining in text documents. Then, it presents two algorithms proposed by the authors of this chapter for efficiently finding maximal sequential patterns in text documents. Additionally, it describes the use of maximal sequential patterns as a quantitative semantic tool for solving different problems related to document processing and web content mining. Finally, it shows some future research directions and conclusions.

Download Full-text

Web Search Results Discovery by Multi-granular Graphs

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch006 ◽

2011 ◽

pp. 118-136

Author(s):

Gloria Bordogna ◽

Alessandro Campi ◽

Giuseppe Psaila ◽

Stefania Ronchi

Keyword(s):

Search Engine ◽

Search Engines ◽

Web Search ◽

Search Process ◽

Dynamic Clustering ◽

Search Results

In this chapter, the authors propose a novel multi-granular framework for visualization and exploration of the results of a complex search process, performed by a user by submitting several queries to possibly distinct search engines. The primary aim of the approach is to supply users with summaries, with distinct levels of details, of the results for a search process. It applies dynamic clustering to the results in each ordered list retrieved by a search engine evaluating a user’s query. The single retrieved items, the clusters so identified, and the single retrieved lists, are considered as dealing with topics at distinct levels of granularity, from the finest level to the coarsest one, respectively. Implicit topics are revealed by associating labels with the retrieved items, the clusters, and the retrieved lists. Then, some manipulation operators, defined in this chapter, are applied to each pair of retrieved lists, clusters, and single items, to reveal their implicit relationships. These relationships have a semantic nature, since they are labeled to approximately represent the shared documents and the shared sub-topics between each pair of combined elements. Finally, both the topics retrieved by the distinct searches and their relationships are represented through multi-granular graphs, that represent the retrieved topics at three distinct levels of granularity. The exploration of the results can be performed by expanding the graphs nodes to see their contents, and by expanding the edges to see their shared contents and their common sub-topics.

Download Full-text

Probabilistic Topic Discovery and Automatic Document Tagging

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch002 ◽

2011 ◽

pp. 25-49

Author(s):

Davide Magatti ◽

Fabio Stella

Keyword(s):

Latent Dirichlet Allocation ◽

Similarity Measures ◽

Classification Problem ◽

Optimal Number ◽

Topic Extraction ◽

Term Frequency ◽

Latent Dirichlet Allocation Model ◽

Topic Discovery ◽

Document Frequency ◽

Document Collection

A software system for topic discovery and document tagging is described. The system discovers the topics hidden in a given document collection, labels them according to user supplied taxonomy and tags new documents. It implements an information processing pipeline which consists of document preprocessing, topic extraction, automatic labeling of topics, and multi-label document classification. The preprocessing module allows importing of several kinds of documents and offers different document representations: binary, term frequency and term frequency inverse document frequency. The topic extraction module is implemented through a proprietary version of the Latent Dirichlet Allocation model. The optimal number of topics is selected through hierarchical clustering. The topic labeling module optimizes a set of similarity measures defined over the user supplied taxonomy. It is implemented through an algorithm over a topic tree. The document tagging module solves a multi-label classification problem through multi-net Naïve Bayes without the need to perform any learning tasks.

Download Full-text

Document Search Images in Text Collections for Restricted Domains on Websites

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch009 ◽

2011 ◽

pp. 183-203

Author(s):

Pavel Makagonov ◽

Celia B.Reyes E. ◽

Grigori Sidorov

Keyword(s):

Information System ◽

Quantitative Analysis ◽

Urban Planning ◽

Digital Library ◽

Main Idea ◽

Text Document ◽

Text Collections ◽

Document Search ◽

Word Frequencies

The main idea of the authors’ research is to perform quantitative analysis of a text collection during the process of its preparation and transformation into a digital library for a website. They use as a case study the digital library of the website on Mixtec culture that we maintain. The authors propose using the concept of the text document search image (TDSI). For creating TDSIs they make analysis of word frequencies in the documents and distinguish between the Zipf’s distribution that is typical for meaningful words and distributions approximated by an ellipse typical for auxiliary words. The authors also describe some analogies of these distributions in architecture and in urban planning. We describe a toolkit DDL that allows for TDSI creation and show its application for the mentioned website and for the corpus of dialogs with railway office information system.

Download Full-text

Clustering Via Centroids a Bag of Qualitative values and Measuring its Inconsistency

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch001 ◽

2011 ◽

pp. 1-24

Author(s):

Adolfo Guzman-Arenas ◽

Alma-Delia Cuevas

All observers are equally credible, so differences in their findings arise from perception errors.

Download Full-text

Learning Full-Sentence Co-Related Verb Argument Preferences from Web Corpora

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch007 ◽

2011 ◽

pp. 137-162

Author(s):

Hiram Calvo ◽

Kentaro Inui ◽

Yuji Matsumoto

Keyword(s):

Machine Learning ◽

Generative Models ◽

Ensemble Model ◽

Semantic Features ◽

Data Sparseness ◽

Task Learning

Learning verb argument preferences has been approached as a verb and argument problem, or at most as a tri-nary relationship between subject, verb and object. However, the simultaneous correlation of all arguments in a sentence has not been explored thoroughly for sentence plausibility mensuration because of the increased number of potential combinations and data sparseness. In this work the authors present a review of some common methods for learning argument preferences beginning with the simplest case of considering binary co-relations, then they compare with tri-nary co-relations, and finally they consider all arguments. For this latter, the authors use an ensemble model for machine learning using discriminative and generative models, using co-occurrence features, and semantic features in different arrangements. They seek to answer questions about the number of optimal topics required for PLSI and LDA models, as well as the number of co-occurrences that should be required for improving performance. They explore the implications of using different ways of projecting co-relations, i.e., into a word space, or directly into a co-occurrence features space. The authors conducted tests using a pseudo-disambiguation task learning from large corpora extracted from Internet.

Download Full-text

Smoothing Text Representation Models Based on Rough Set

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch003 ◽

2011 ◽

pp. 50-68

Author(s):

Zhihua Wei ◽

Duoqian Miao ◽

Ruizhi Wang ◽

Zhifei Zhang

Keyword(s):

Rough Set ◽

Text Classification ◽

Latent Dirichlet Allocation ◽

Rough Set Theory ◽

Text Representation ◽

Upper Approximation ◽

Data Sparseness ◽

Statistic Theory ◽

Absent Words ◽

Tolerance Rough Set

Text representation is the prerequisite of various document processing tasks, such as information retrieval, text classification, text clustering, etc. It has been studied intensively for the past few years, and many excellent models have been designed as well. However, the performance of these models is affected by the problem of data sparseness. Existing smoothing techniques usually make use of statistic theory or linguistic information to assign a uniform distribution to absent words. They do not concern the real word distribution or distinguish between words. In this chapter, a method based on a kind of soft computing theory, Tolerance Rough Set theory, which makes use of upper approximation and lower approximation theory in Rough Set to assign different values for absent words in different approximation regions, is proposed. Theoretically, our algorithms can estimate smoothing value for absent words according to their relation with respect to existing words. Text classification experiments by using Vector Space Model (VSM) and Latent Dirichlet Allocation (LDA) model on public corpora have shown that our algorithms greatly improve the performance of text representation model, especially for the performance of unbalanced corpus.

Download Full-text

Topic Discovery in Web Collections via Graph Local Clustering

Quantitative Semantics and Soft Computing Methods for the Web ◽

10.4018/978-1-60960-881-1.ch011 ◽

2011 ◽

pp. 228-253

Author(s):

Sara Elena Garza Villarreal ◽

Ramón Brena

Keyword(s):

Extensive Review ◽

Topic Discovery ◽

Local Clustering ◽

Topic Mining

An extensive review of existing literature on clustering and topic mining is given throughout the chapter as well.

Download Full-text

Quantitative Semantics and Soft Computing Methods for the Web
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Combining Diverse Knowledge Based Features for Semantic Relatedness Measures

Query Based Topic Modeling

Maximal Sequential Patterns

Web Search Results Discovery by Multi-granular Graphs

Probabilistic Topic Discovery and Automatic Document Tagging

Document Search Images in Text Collections for Restricted Domains on Websites

Clustering Via Centroids a Bag of Qualitative values and Measuring its Inconsistency

Learning Full-Sentence Co-Related Verb Argument Preferences from Web Corpora

Smoothing Text Representation Models Based on Rough Set

Topic Discovery in Web Collections via Graph Local Clustering

Export Citation Format

Quantitative Semantics and Soft Computing Methods for the WebLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Combining Diverse Knowledge Based Features for Semantic Relatedness Measures

Query Based Topic Modeling

Maximal Sequential Patterns

Web Search Results Discovery by Multi-granular Graphs

Probabilistic Topic Discovery and Automatic Document Tagging

Document Search Images in Text Collections for Restricted Domains on Websites

Clustering Via Centroids a Bag of Qualitative values and Measuring its Inconsistency

Learning Full-Sentence Co-Related Verb Argument Preferences from Web Corpora

Smoothing Text Representation Models Based on Rough Set

Topic Discovery in Web Collections via Graph Local Clustering

Quantitative Semantics and Soft Computing Methods for the Web
Latest Publications