Emerging Technologies of Text Mining
Latest Publications


TOTAL DOCUMENTS

14
(FIVE YEARS 0)

H-INDEX

3
(FIVE YEARS 0)

Published By IGI Global

9781599043739, 9781599043753

Author(s):  
Wagner Francisco Castilho ◽  
Gentil José de Lucena Filho ◽  
Hércules Antonio do Prado ◽  
Edilson Ferneda

Clustering analysis (CA) techniques consist in, given a set of objects, estimating dense regions of points separated by sparse regions, according to the dimensions that describe these objects. Independently from the data nature – structured or non-structured -, we look for homogenous clouds of points, that define clusters, from which we want to extract some meaning. In other words, when doing CA, the analyst is searching for underlying structures in a multidimensional space for what one could assign some meaning. Grossly, to carry a CA application, two main activities are involved: generating clusters configurations by means of an algorithm and interpreting these configurations in order to approximate a solution that could contribute with the CA application objective. Generating a clusters configuration is typically a computational task, while the interpretation task brings a strong burden of subjectivity. Many approaches are presented in the literature for generating clusters configuration. Unfortunately, the interpretation task has not received so much attention, possibly due to the difficulty in modeling something that is subjective in nature. In this chapter a method to guide the interpretation of a clusters configuration is proposed. The inherent subjectivity is approached directly by describing the process with the apparatus of the Ontology of Language. The main contribution of this chapter is to provide a sound conceptual basis to guide the analyst in extracting meaning from the patterns found in a set of data, no matter we are talking about data bases, a set of free texts, or a set of web pages.


Author(s):  
Leandro Krug Wives ◽  
José Palazzo Moreira de Oliveira ◽  
Stanley Loh

This chapter introduces a technique to cluster textual documents using concepts. Document clustering is a technique capable of organizing large amounts of documents in clusters of related information, which helps the localization of relevant information. Traditional document clustering techniques use words to represent the contents of the documents and the use of words may cause semantic mistakes. Concepts, instead, represent real world events and objects, and people employ them to express ideas, thoughts, opinions and intentions. Thus, concepts are more appropriate to represent the contents of a document and its use helps the comprehension of large document collections, since it is possible to summarize each cluster and rapidly identify its contents (i.e. concepts). To perform this task, the chapter presents a methodology to cluster documents using concepts and presents some practical experiments in a case study to demonstrate that the proposed approach achieves better results than the use of words.


Author(s):  
Li Weigang ◽  
Wu Man Qi

This chapter presents a study of Ant Colony Optimization (ACO) to Interlegis Web portal, Brazilian legislation Website. The approach of AntWeb is inspired by ant colonies foraging behavior to adaptively mark the most significant link by means of the shortest route to arrive the target pages. The system considers the users in the Web portal as artificial ants and the links among the pages of the Web pages as the researching network. To identify the group of the visitors, Web mining is applied to extract knowledge based on preprocessing Web log files. The chapter describes the theory, model, main utilities and implementation of AntWeb prototype in Interlegis Web portal. The case study shows Off-line Web mining; simulations without and with the use of AntWeb; testing by modification of the parameters. The result demonstrates the sensibility and accessibility of AntWeb and the benefits for the Interlegis Web users.


Author(s):  
Patricia Bintzler Cerrito

The purpose of this chapter is to demonstrate how text mining can be used to reduce the number of levels in a categorical variable to then use the variable in a predictive model. The method works particularly well when several levels of the variable have the same identifier so that they can be combined into a text string of variables. The stemming property of the linked words is used to create clusters of these strings. In this chapter, we validate the technique through kernel density estimation, and we compare this technique to other techniques used to reduce the levels of categorical data.


Author(s):  
Jon Atle Gulla ◽  
Hans Olaf Borch ◽  
Jon Espen Ingvaldsen

Due to the large amount of information on the web and the difficulties of relating user’s expressed information needs to document content, large-scale web search engines tend to return thousands of ranked documents. This chapter discusses the use of clustering to help users navigate through the result sets and explore the domain. A newly developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the weaknesses of traditional clustering approaches. Using result snippets rather than full documents, HOBSearch both speeds up clustering substantially and manages to tailor the clustering to the topics indicated in user’s query. An inherent problem with clustering, though, is the choice of cluster labels. Our experiments with HOBSearch show that cluster labels of an acceptable quality can be generated with no upervision or predefined structures and within the constraints given by large-scale web search.


Author(s):  
Christian Aranha ◽  
Emmanuel Passos

This chapter integrates elements from Natural Language Processing, Information Retrieval, Data Mining and Text Mining to support competitive intelligence. It shows how text mining algorithms can attend to three important functionalities of CI: Filtering, Event Alerts and Search. Each of them can be mapped as a different pipeline of NLP tasks. The chapter goes in-depth in NLP techniques like spelling correction, stemming, augmenting, normalization, entity recognition, entity classification, acronyms and co-reference process. Each of them must be used in a specific moment to do a specific job. All these jobs will be integrated in a whole system. These will be ‘assembled’ in a manner specific to each application. The reader’s better understanding of the theories of NLP provided herein will result in a better ´assembly´.


Author(s):  
Roberto Penteado ◽  
Eric Boutin

The information overload demands that organizations set up new capabilities concerning the analysis of data and texts to create the necessary information. This chapter presents a bibliometrical approach for mining on structured text and data tuned to the French school of information science. These methodologies and techniques allow organizations to identify the valuable information that will generate better decisions, enabling and capacitating them to accomplish their mission and attain competitive advantages over the competition. The authors think that information treatment and analysis is the most critical organizational competence on our information society and that organizations and universities should take measures to develop this new field of research.


Author(s):  
Domonkos Tikk ◽  
György Biro ◽  
Attila Törcsvári

Abstract: Patent categorization (PC) is a typical application area of text categorization (TC). TC can be applied in different scenarios at the work of patent offices depending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with large number of categories (in the magnitude of 1000–10000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this paper we present a hierarchical online classifier, called HITEC, which meets the above requirements. The novelty of the method relies on the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and on the relaxed category selection method. We evaluate the presented method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora. We also compare the presented method to other TC algorithms on these collections, and show that it outperforms them significantly.


Author(s):  
Lean Yu ◽  
Shouyang Wang ◽  
Kin Keung Lai

With the rapid increase of the huge amount of online information, there is a strong demand for Web text mining which helps people discover some useful knowledge from Web documents. For this purpose, this chapter first proposes a back-propagation neural network (BPNN)-based Web text mining system for decision support. In the BPNN-based Web text mining system, four main processes, Web document search, Web text processing, text feature conversion, and BPNN-based knowledge discovery, are involved. Particularly, BPNN is used as an intelligent learning agent that learns about underlying Web documents. In order to scale the individual intelligent agent with the large number of Web documents, we then provide a multi-agent-based neural network system for Web text mining in a parallel way. For illustration purpose, a simulated experiment is performed. Experiment results reveal that the proposed multi-agent neural network system is an effective solution to large scale Web text mining.


Author(s):  
Ying Liu ◽  
Han Tong Loh ◽  
Wen Feng Lu

This chapter introduces an approach of deriving taxonomy from documents using a novel document profile model that enables document representations with the semantic information systematically generated at the document sentence level. A frequent word sequence method is proposed to search for the salient semantic information and has been integrated into the document profile model. The experimental study of taxonomy generation using hierarchical agglomerative clustering has shown a significant improvement in terms of Fscore based on the document profile model. A close examination reveals that the integration of semantic information has a clear contribution compared to the classic bag-of-words approach. This study encourages us to further investigate the possibility of applying document profile model over a wide range of text based mining tasks.


Sign in / Sign up

Export Citation Format

Share Document