Emerging Technologies of Text Mining

An Interpretation Process for Clustering Analysis Based on the Ontology of Language

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch014 ◽

2008 ◽

pp. 297-320

Author(s):

Wagner Francisco Castilho ◽

Gentil José de Lucena Filho ◽

Hércules Antonio do Prado ◽

Edilson Ferneda

Keyword(s):

Clustering Analysis ◽

Multidimensional Space ◽

Web Pages ◽

Conceptual Basis ◽

Data Bases ◽

Interpretation Process

Clustering analysis (CA) techniques consist in, given a set of objects, estimating dense regions of points separated by sparse regions, according to the dimensions that describe these objects. Independently from the data nature – structured or non-structured -, we look for homogenous clouds of points, that define clusters, from which we want to extract some meaning. In other words, when doing CA, the analyst is searching for underlying structures in a multidimensional space for what one could assign some meaning. Grossly, to carry a CA application, two main activities are involved: generating clusters configurations by means of an algorithm and interpreting these configurations in order to approximate a solution that could contribute with the CA application objective. Generating a clusters configuration is typically a computational task, while the interpretation task brings a strong burden of subjectivity. Many approaches are presented in the literature for generating clusters configuration. Unfortunately, the interpretation task has not received so much attention, possibly due to the difficulty in modeling something that is subjective in nature. In this chapter a method to guide the interpretation of a clusters configuration is proposed. The inherent subjectivity is approached directly by describing the process with the apparatus of the Ontology of Language. The main contribution of this chapter is to provide a sound conceptual basis to guide the analyst in extracting meaning from the patterns found in a set of data, no matter we are talking about data bases, a set of free texts, or a set of web pages.

Download Full-text

Conceptual Clustering of Textual Documents and Some Insights for Knowledge Discovery

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch011 ◽

2008 ◽

pp. 223-243 ◽

Cited By ~ 2

Author(s):

Leandro Krug Wives ◽

José Palazzo Moreira de Oliveira ◽

Stanley Loh

Keyword(s):

Knowledge Discovery ◽

Real World ◽

Document Clustering ◽

Relevant Information ◽

Conceptual Clustering ◽

Document Collections ◽

Clustering Techniques ◽

Related Information

This chapter introduces a technique to cluster textual documents using concepts. Document clustering is a technique capable of organizing large amounts of documents in clusters of related information, which helps the localization of relevant information. Traditional document clustering techniques use words to represent the contents of the documents and the use of words may cause semantic mistakes. Concepts, instead, represent real world events and objects, and people employ them to express ideas, thoughts, opinions and intentions. Thus, concepts are more appropriate to represent the contents of a document and its use helps the comprehension of large document collections, since it is possible to summarize each cluster and rapidly identify its contents (i.e. concepts). To perform this task, the chapter presents a methodology to cluster documents using concepts and presents some practical experiments in a case study to demonstrate that the proposed approach achieves better results than the use of words.

Download Full-text

AntWeb—Web Search Based on Ant Behavior

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch010 ◽

2008 ◽

pp. 208-222

Author(s):

Li Weigang ◽

Wu Man Qi

Keyword(s):

Web Mining ◽

Web Search ◽

Theory Model ◽

Web Pages ◽

Web Portal ◽

Knowledge Based ◽

Log Files ◽

Ant Behavior ◽

Shortest Route ◽

The Web

This chapter presents a study of Ant Colony Optimization (ACO) to Interlegis Web portal, Brazilian legislation Website. The approach of AntWeb is inspired by ant colonies foraging behavior to adaptively mark the most significant link by means of the shortest route to arrive the target pages. The system considers the users in the Web portal as artificial ants and the links among the pages of the Web pages as the researching network. To identify the group of the visitors, Web mining is applied to extract knowledge based on preprocessing Web log files. The chapter describes the theory, model, main utilities and implementation of AntWeb prototype in Interlegis Web portal. The case study shows Off-line Web mining; simulations without and with the use of AntWeb; testing by modification of the parameters. The result demonstrates the sensibility and accessibility of AntWeb and the benefits for the Interlegis Web users.

Download Full-text

Text Mining to Define a Validated Model of Hospital Rankings

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch013 ◽

2008 ◽

pp. 268-296

Author(s):

Patricia Bintzler Cerrito

Keyword(s):

Text Mining ◽

Predictive Model ◽

Density Estimation ◽

Kernel Density Estimation ◽

Categorical Data ◽

Kernel Density ◽

Categorical Variable ◽

Text String

The purpose of this chapter is to demonstrate how text mining can be used to reduce the number of levels in a categorical variable to then use the variable in a predictive model. The method works particularly well when several levels of the variable have the same identifier so that they can be combined into a text string of variables. The stemming property of the linked words is used to create clusters of these strings. In this chapter, we validate the technique through kernel density estimation, and we compare this technique to other techniques used to reduce the levels of categorical data.

Download Full-text

Contextualized Clustering in Exploratory Web Search

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch009 ◽

2008 ◽

pp. 184-207 ◽

Cited By ~ 3

Author(s):

Jon Atle Gulla ◽

Hans Olaf Borch ◽

Jon Espen Ingvaldsen

Keyword(s):

Search Engines ◽

Information Needs ◽

Large Scale ◽

Suffix Tree ◽

Web Search ◽

Amount Of Information ◽

Acceptable Quality ◽

Inherent Problem ◽

Web Search Engines ◽

The Web

Due to the large amount of information on the web and the difficulties of relating user’s expressed information needs to document content, large-scale web search engines tend to return thousands of ranked documents. This chapter discusses the use of clustering to help users navigate through the result sets and explore the domain. A newly developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the weaknesses of traditional clustering approaches. Using result snippets rather than full documents, HOBSearch both speeds up clustering substantially and manages to tailor the clustering to the topics indicated in user’s query. An inherent problem with clustering, though, is the choice of cluster labels. Our experiments with HOBSearch show that cluster labels of an acceptable quality can be generated with no upervision or predefined structures and within the constraints given by large-scale web search.

Download Full-text

Automatic NLP for Competitive Intelligence

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch003 ◽

2008 ◽

pp. 54-76 ◽

Cited By ~ 1

Author(s):

Christian Aranha ◽

Emmanuel Passos

Keyword(s):

Data Mining ◽

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Language Processing ◽

Entity Recognition ◽

Competitive Intelligence ◽

Reference Process ◽

Processing Information ◽

Mining Algorithms

This chapter integrates elements from Natural Language Processing, Information Retrieval, Data Mining and Text Mining to support competitive intelligence. It shows how text mining algorithms can attend to three important functionalities of CI: Filtering, Event Alerts and Search. Each of them can be mapped as a different pipeline of NLP tasks. The chapter goes in-depth in NLP techniques like spelling correction, stemming, augmenting, normalization, entity recognition, entity classification, acronyms and co-reference process. Each of them must be used in a specific moment to do a specific job. All these jobs will be integrated in a whole system. These will be ‘assembled’ in a manner specific to each application. The reader’s better understanding of the theories of NLP provided herein will result in a better ´assembly´.

Download Full-text

Creating Strategic Information for Oranizations with Structured Text

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch002 ◽

2008 ◽

pp. 34-53 ◽

Cited By ~ 1

Author(s):

Roberto Penteado ◽

Eric Boutin

Keyword(s):

Information Society ◽

Information Science ◽

Information Overload ◽

Competitive Advantages ◽

Strategic Information ◽

French School ◽

Information Treatment ◽

Set Up ◽

Organizational Competence

The information overload demands that organizations set up new capabilities concerning the analysis of data and texts to create the necessary information. This chapter presents a bibliometrical approach for mining on structured text and data tuned to the French school of information science. These methodologies and techniques allow organizations to identify the valuable information that will generate better decisions, enabling and capacitating them to accomplish their mission and attain competitive advantages over the competition. The authors think that information treatment and analysis is the most critical organizational competence on our information society and that organizations and universities should take measures to develop this new field of research.

Download Full-text

A Hierarchical Online Classifier for Patent Categorization

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch012 ◽

2008 ◽

pp. 244-267 ◽

Cited By ~ 10

Author(s):

Domonkos Tikk ◽

György Biro ◽

Attila Törcsvári

Keyword(s):

Text Categorization ◽

Patent Application ◽

Selection Method ◽

Application Area ◽

Typical Application

Abstract: Patent categorization (PC) is a typical application area of text categorization (TC). TC can be applied in different scenarios at the work of patent offices depending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with large number of categories (in the magnitude of 1000–10000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this paper we present a hierarchical online classifier, called HITEC, which meets the above requirements. The novelty of the method relies on the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and on the relaxed category selection method. We evaluate the presented method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora. We also compare the presented method to other TC algorithms on these collections, and show that it outperforms them significantly.

Download Full-text

A Multi-Agent Neural Network System for Web Text Mining

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch008 ◽

2008 ◽

pp. 162-183

Author(s):

Lean Yu ◽

Shouyang Wang ◽

Kin Keung Lai

Keyword(s):

Neural Network ◽

Text Mining ◽

Text Processing ◽

Network System ◽

Mining System ◽

Web Documents ◽

Neural Network System ◽

Multi Agent ◽

Web Text Mining ◽

Text Mining System

With the rapid increase of the huge amount of online information, there is a strong demand for Web text mining which helps people discover some useful knowledge from Web documents. For this purpose, this chapter first proposes a back-propagation neural network (BPNN)-based Web text mining system for decision support. In the BPNN-based Web text mining system, four main processes, Web document search, Web text processing, text feature conversion, and BPNN-based knowledge discovery, are involved. Particularly, BPNN is used as an intelligent learning agent that learns about underlying Web documents. In order to scale the individual intelligent agent with the large number of Web documents, we then provide a multi-agent-based neural network system for Web text mining in a parallel way. For illustration purpose, a simulated experiment is performed. Experiment results reveal that the proposed multi-agent neural network system is an effective solution to large scale Web text mining.

Download Full-text

Deriving Taxonomy from Documents at Sentence Level

Emerging Technologies of Text Mining ◽

10.4018/978-1-59904-373-9.ch005 ◽

2008 ◽

pp. 99-119 ◽

Cited By ~ 6

Author(s):

Ying Liu ◽

Han Tong Loh ◽

Wen Feng Lu

Keyword(s):

Experimental Study ◽

Semantic Information ◽

Bag Of Words ◽

Agglomerative Clustering ◽

Profile Model ◽

Word Sequence ◽

Sentence Level ◽

Wide Range ◽

Hierarchical Agglomerative Clustering ◽

Sequence Method

This chapter introduces an approach of deriving taxonomy from documents using a novel document profile model that enables document representations with the semantic information systematically generated at the document sentence level. A frequent word sequence method is proposed to search for the salient semantic information and has been integrated into the document profile model. The experimental study of taxonomy generation using hierarchical agglomerative clustering has shown a significant improvement in terms of Fscore based on the document profile model. A close examination reveals that the integration of semantic information has a clear contribution compared to the classic bag-of-words approach. This study encourages us to further investigate the possibility of applying document profile model over a wide range of text based mining tasks.

Download Full-text

Emerging Technologies of Text Mining
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

An Interpretation Process for Clustering Analysis Based on the Ontology of Language

Conceptual Clustering of Textual Documents and Some Insights for Knowledge Discovery

AntWeb—Web Search Based on Ant Behavior

Text Mining to Define a Validated Model of Hospital Rankings

Contextualized Clustering in Exploratory Web Search

Automatic NLP for Competitive Intelligence

Creating Strategic Information for Oranizations with Structured Text

A Hierarchical Online Classifier for Patent Categorization

A Multi-Agent Neural Network System for Web Text Mining

Deriving Taxonomy from Documents at Sentence Level

Export Citation Format

Emerging Technologies of Text MiningLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

An Interpretation Process for Clustering Analysis Based on the Ontology of Language

Conceptual Clustering of Textual Documents and Some Insights for Knowledge Discovery

AntWeb—Web Search Based on Ant Behavior

Text Mining to Define a Validated Model of Hospital Rankings

Contextualized Clustering in Exploratory Web Search

Automatic NLP for Competitive Intelligence

Creating Strategic Information for Oranizations with Structured Text

A Hierarchical Online Classifier for Patent Categorization

A Multi-Agent Neural Network System for Web Text Mining

Deriving Taxonomy from Documents at Sentence Level

Emerging Technologies of Text Mining
Latest Publications