Text Mining-Machine Learning on Documents

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch208 ◽

2011 ◽

pp. 1109-1112 ◽

Cited By ~ 1

Author(s):

Dunja Mladenic

Keyword(s):

Text Mining ◽

User Profiling ◽

Mining Machine ◽

Text Data ◽

Automatic Data ◽

Topic Identification ◽

Document Search ◽

Text Information ◽

Document Categorization ◽

Automatic Document Summarization

Intensive usage and growth of the World Wide Web and the daily increasing amount of text information in electronic form have resulted in a growing need for computer-supported ways of dealing with text data. One of the most popular problems addressed with text mining methods is document categorization. Document categorization aims to classify documents into pre-defined categories, based on their content. Other important problems addressed in text mining include document search, based on the content, automatic document summarization, automatic document clustering and construction of document hierarchies, document authorship detection, identification of plagiarism of documents, topic identification and tracking, information extraction, hypertext analysis, and user profiling. If we agree on text mining being a fairly broad area dealing with computer-supported analysis of text, then the list of problems that can be addressed is rather long and open. Here we adopt this fairly open view but concentrate on the parts related to automatic data analysis and data mining.

Download Full-text

Dual Scaling in Data Mining from Text Databases

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2006.p0451 ◽

2006 ◽

Vol 10 (4) ◽

pp. 451-457 ◽

Cited By ~ 3

Author(s):

Junzo Watada ◽

◽

Keisuke Aoki ◽

Masahiro Kawano ◽

Muhammad Suzuri Hitam ◽

...

Keyword(s):

Multivariate Analysis ◽

Text Mining ◽

Kansei Engineering ◽

Semantic Meaning ◽

Dual Scaling ◽

Text Documents ◽

Text Data ◽

Text Document ◽

Text Information ◽

Quantification Model

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.

Download Full-text

Method for hiding text data in an image

Bulletin of the Innovative University of Eurasia ◽

10.37788/2021-3/72-79 ◽

2021 ◽

Vol 83 (1) ◽

pp. 72-79

Author(s):

O.A. Kan ◽

◽

N.A. Mazhenov ◽

K.B. Kopbalina ◽

G.B. Turebaeva ◽

...

Keyword(s):

Secret Message ◽

Source Image ◽

Hidden Information ◽

Text Data ◽

Code Table ◽

Ascii Code ◽

Text Information ◽

Graphic File ◽

Image Pixels ◽

Digital Fingerprints

The main problem: The article deals with the issues of hiding text information in a graphic file. A formula for hiding text information in image pixels is proposed. A steganography scheme for embedding secret text in random image pixels has been developed. Random bytes are pre-embedded in each row of pixels in the source image. As a result of the operations performed, a key image is obtained. The text codes are embedded in random bytes of pixels of a given RGB channel. To form a secret message, the characters of the ASCII code table are used. Demo encryption and decryption programs have been developed in the Python 3.5.2 programming language. A graphic file is used as the decryption key. Purpose: To develop an algorithm for embedding text information in random pixels of an image. Methods: Among the methods of hiding information in graphic images, the LSB method of hiding information is widely used, in which the lower bits in the image bytes responsible for color encoding are replaced by the bits of the secret message. Analysis of methods of hiding information in graphic files and modeling of algorithms showed an increase in the level of protection of hidden information from detection. Results and their significance: Using the proposed steganography scheme and the algorithm for embedding bytes of a secret message in a graphic file, protection against detection of hidden information is significantly increased. The advantage of this steganography scheme is that for decryption, a key image is used, in which random bytes are pre-embedded. In addition, the entire pixel bits of the container image are used to display the color shades. It can also be noted that the developed steganography scheme allows not only to transmit secret information, but also to add digital fingerprints or hidden tags to the image.

Download Full-text

Exploring the Potentialities of Automatic Extraction of University Webometric Information

Journal of Data and Information Science ◽

10.2478/jdis-2020-0040 ◽

2020 ◽

Vol 5 (4) ◽

pp. 43-55

Author(s):

Gianpiero Bianchi ◽

Renato Bruni ◽

Cinzia Daraio ◽

Antonio Laureti Palma ◽

Giulio Perani ◽

...

Keyword(s):

Text Mining ◽

Web Mining ◽

Knowledge Extraction ◽

Automatic Extraction ◽

Mining Operations ◽

Automatic Data ◽

Link Type ◽

Web Scraping ◽

University Systems ◽

The Web

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Download Full-text

Aspect-level sentiment analysis merged with knowledge graph and graph convolutional neural network

Journal of Physics Conference Series ◽

10.1088/1742-6596/2083/4/042044 ◽

2021 ◽

Vol 2083 (4) ◽

pp. 042044

Author(s):

Zuhua Dai ◽

Yuanyuan Liu ◽

Shilong Di ◽

Qi Fan

Keyword(s):

Neural Network ◽

Sentiment Analysis ◽

Structural Information ◽

Knowledge Graph ◽

Convolutional Network ◽

Text Data ◽

Short Text ◽

Fine Grained ◽

Syntactic Information ◽

Text Information

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.

Download Full-text

Incorporating Text OLAP in Business Intelligence

Business Intelligence Applications and the Web - Advances in Business Information Systems and Analytics ◽

10.4018/978-1-61350-038-5.ch004 ◽

2011 ◽

pp. 77-101 ◽

Cited By ~ 1

Author(s):

Byung-Kwon Park ◽

Il-Yeol Song

Keyword(s):

Information Retrieval ◽

Text Mining ◽

Business Intelligence ◽

Multidimensional Analysis ◽

Web Pages ◽

Data Types ◽

Text Documents ◽

Text Data ◽

Platform Architecture ◽

Unstructured Text

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.

Download Full-text

Events Automatic Extraction from Arabic Texts

Natural Language Processing ◽

10.4018/978-1-7998-0951-7.ch078 ◽

2020 ◽

pp. 1686-1704

Author(s):

Emna Hkiri ◽

Souheyl Mallat ◽

Mounir Zrigui

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Text Mining ◽

Machine Translation ◽

Language Processing ◽

Question Answering ◽

Arabic Language ◽

Event Extraction ◽

Mining Machine ◽

Open Domain

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.

Download Full-text

Text Mining to Support Consulting Services for Client Company State Recognition

International Journal of Automation Technology ◽

10.20965/ijat.2020.p0779 ◽

2020 ◽

Vol 14 (5) ◽

pp. 779-790

Author(s):

Ruriko Watanabe ◽

Nobutada Fujii ◽

Daisuke Kokuryo ◽

Toshiya Kaihara ◽

Yoichi Abe ◽

...

Keyword(s):

Text Mining ◽

Support System ◽

Computer Experiments ◽

Text Data ◽

Logistic Regression Models ◽

Service Companies ◽

Consulting Services ◽

Customer Information ◽

Problem Detection ◽

Specialized Service

This study was conducted to devise a method for supporting consulting service companies in their response to client demands irrespective of the expertise of consultants. With emphasis on revitalization of small and medium-sized enterprises, the importance of support systems for consulting services to serve them is increasing. Those systems must support solutions to difficulties that must be addressed by enterprises. Consulting companies can respond to widely various management consultations. Nevertheless, because the consultation contents are highly specialized, service proposals and problem detection depend on the experience and intuition of the consultant. Often, stable service cannot be provided. A support system must provide stable services independent of the ability of consultants. In this study, analyzing customer information describing the contents of consultation with client companies is the first step in constructing a support system that can predict future problems. Text data such as a consultant’s visit history, consultation contents by e-mail, and contents of call centers are used for analyses because the contents can explain current problems. They might also indicate future problems. This report describes a method to analyze text data using text mining. The target problem is fraud, which includes uncertainty: cases in which it is not clear whether a fraud problem has occurred with the company. To address uncertainty, a method of using logistic regression models is proposed to represent inferred values as probabilities, rather than as binary discriminated data, because the possibility exists that some misidentified companies might have some difficulty. As described herein, computer experiments are conducted to verify the effectiveness of the proposed method and to compare consultants’ forecasted and achieved results. Results of a verification experiment are presented in the following. First, the proposed method is applicable to problems including uncertainties. Secondly, the possibility exists of discovering companies with a fraud problem of which they are unaware.

Download Full-text

The Method to Analyze Freely Described Data from Questionnaires

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2009.p0268 ◽

2009 ◽

Vol 13 (3) ◽

pp. 268-274 ◽

Cited By ~ 3

Author(s):

Masaomi Kimura ◽

Keyword(s):

Text Mining ◽

Call Center ◽

Clustering Algorithms ◽

Web Pages ◽

Research Papers ◽

Text Data ◽

Textual Data ◽

The Way ◽

Newspaper Articles

Text mining has been growing; mainly due to the need to extract useful information from vast amounts of textual data. Our target here is text data, a collection of freely described data from questionnaires. Unlike research papers, newspaper articles, call-center logs and web pages, which are usually the targets of text mining analysis, the freely described data contained in the questionnaire responses have specific characteristics, including a small number of short sentences forming individual pieces of data, while the wide variety of content precludes the applications of clustering algorithms used to classify the same. In this paper, we suggest the way to extract the opinions which are delivered by multiple respondents, based on the modification relationships included in each sentence in the freely described data. Certain applications of our method are also presented after the introduction of our approach.

Download Full-text

Citation mining: Integrating text mining and bibliometrics for research user profiling

Journal of the American Society for Information Science and Technology ◽

10.1002/asi.1181 ◽

2001 ◽

Vol 52 (13) ◽

pp. 1148-1156 ◽

Cited By ~ 63

Author(s):

Ronald N. Kostoff ◽

J. Antonio del Río ◽

James A. Humenik ◽

Esther Ofilia García ◽

Ana María Ramírez

Keyword(s):

Text Mining ◽

User Profiling ◽

Research User

Download Full-text

Analysis of Blended Learning Model Application Using Text Mining Method

International Journal of Emerging Technologies in Learning (iJET) ◽

10.3991/ijet.v16i01.19823 ◽

2021 ◽

Vol 16 (01) ◽

pp. 172

Author(s):

Lin Wang ◽

Yanfen Huang ◽

Muhd Khaizer Omar

Keyword(s):

Text Mining ◽

Model Selection ◽

Blended Learning ◽

Social Research ◽

Rapid Development ◽

Learning Model ◽

Mining Method ◽

Implementation Model ◽

Practice Models ◽

Text Information

The rapid development of networks has resulted in the recognition of blended learning as an effective learning model. The text mining method was used to analyze the blended learning practice data of 17 countries provided by the Christensen Institute. By classifying and extracting text information from the use of blended learning model selection and the challenges of blended learning, the factors that hinder the implementation of blended learning were analyzed. The distribution of blended learning courses and practice models in each country were discussed, as was the influence relationship between the region and implementation model. The results demonstrate that the practice of blended learning in primary and secondary schools is mature in four courses of English, Mathematics, Science, and Social Research. The blended learning implementation model can be unaffected by the region, but more tend to “mix.” The practice cycle of blended learning becomes long and requires long-term stable support, while teachers’ ability and students’ ability preparation are the largest obstacles to the effectual development of blended learning. This study provides references for improving the efficiency of blended learning practices, especially in the aspects of practice model selection, infrastructure preparation, and teacher and student ability training.

Download Full-text