Text Mining-Machine Learning on Documents

Author(s):  
Dunja Mladenic

Intensive usage and growth of the World Wide Web and the daily increasing amount of text information in electronic form have resulted in a growing need for computer-supported ways of dealing with text data. One of the most popular problems addressed with text mining methods is document categorization. Document categorization aims to classify documents into pre-defined categories, based on their content. Other important problems addressed in text mining include document search, based on the content, automatic document summarization, automatic document clustering and construction of document hierarchies, document authorship detection, identification of plagiarism of documents, topic identification and tracking, information extraction, hypertext analysis, and user profiling. If we agree on text mining being a fairly broad area dealing with computer-supported analysis of text, then the list of problems that can be addressed is rather long and open. Here we adopt this fairly open view but concentrate on the parts related to automatic data analysis and data mining.

Author(s):  
Junzo Watada ◽  
◽  
Keisuke Aoki ◽  
Masahiro Kawano ◽  
Muhammad Suzuri Hitam ◽  
...  

The availability of multimedia text document information has disseminated text mining among researchers. Text documents, integrate numerical and linguistic data, making text mining interesting and challenging. We propose text mining based on a fuzzy quantification model and fuzzy thesaurus. In text mining, we focus on: 1) Sentences included in Japanese text that are broken down into words. 2) Fuzzy thesaurus for finding words matching keywords in text. 3) Fuzzy multivariate analysis to analyze semantic meaning in predefined case studies. We use a fuzzy thesaurus to translate words using Chinese and Japanese characters into keywords. This speeds up processing without requiring a dictionary to separate words. Fuzzy multivariate analysis is used to analyze such processed data and to extract latent mutual related structures in text data, i.e., to extract otherwise obscured knowledge. We apply dual scaling to mining library and Web page text information, and propose integrating the result in Kansei engineering for possible application in sales, marketing, and production.


2021 ◽  
Vol 83 (1) ◽  
pp. 72-79
Author(s):  
O.A. Kan ◽  
◽  
N.A. Mazhenov ◽  
K.B. Kopbalina ◽  
G.B. Turebaeva ◽  
...  

The main problem: The article deals with the issues of hiding text information in a graphic file. A formula for hiding text information in image pixels is proposed. A steganography scheme for embedding secret text in random image pixels has been developed. Random bytes are pre-embedded in each row of pixels in the source image. As a result of the operations performed, a key image is obtained. The text codes are embedded in random bytes of pixels of a given RGB channel. To form a secret message, the characters of the ASCII code table are used. Demo encryption and decryption programs have been developed in the Python 3.5.2 programming language. A graphic file is used as the decryption key. Purpose: To develop an algorithm for embedding text information in random pixels of an image. Methods: Among the methods of hiding information in graphic images, the LSB method of hiding information is widely used, in which the lower bits in the image bytes responsible for color encoding are replaced by the bits of the secret message. Analysis of methods of hiding information in graphic files and modeling of algorithms showed an increase in the level of protection of hidden information from detection. Results and their significance: Using the proposed steganography scheme and the algorithm for embedding bytes of a secret message in a graphic file, protection against detection of hidden information is significantly increased. The advantage of this steganography scheme is that for decryption, a key image is used, in which random bytes are pre-embedded. In addition, the entire pixel bits of the container image are used to display the color shades. It can also be noted that the developed steganography scheme allows not only to transmit secret information, but also to add digital fingerprints or hidden tags to the image.


2020 ◽  
Vol 5 (4) ◽  
pp. 43-55
Author(s):  
Gianpiero Bianchi ◽  
Renato Bruni ◽  
Cinzia Daraio ◽  
Antonio Laureti Palma ◽  
Giulio Perani ◽  
...  

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).


2021 ◽  
Vol 2083 (4) ◽  
pp. 042044
Author(s):  
Zuhua Dai ◽  
Yuanyuan Liu ◽  
Shilong Di ◽  
Qi Fan

Abstract Aspect level sentiment analysis belongs to fine-grained sentiment analysis, w hich has caused extensive research in academic circles in recent years. For this task, th e recurrent neural network (RNN) model is usually used for feature extraction, but the model cannot effectively obtain the structural information of the text. Recent studies h ave begun to use the graph convolutional network (GCN) to model the syntactic depen dency tree of the text to solve this problem. For short text data, the text information is not enough to accurately determine the emotional polarity of the aspect words, and the knowledge graph is not effectively used as external knowledge that can enrich the sem antic information. In order to solve the above problems, this paper proposes a graph co nvolutional neural network (GCN) model that can process syntactic information, know ledge graphs and text semantic information. The model works on the “syntax-knowled ge” graph to extract syntactic information and common sense information at the same t ime. Compared with the latest model, the model in this paper can effectively improve t he accuracy of aspect-level sentiment classification on two datasets.


Author(s):  
Byung-Kwon Park ◽  
Il-Yeol Song

As the amount of data grows very fast inside and outside of an enterprise, it is getting important to seamlessly analyze both data types for total business intelligence. The data can be classified into two categories: structured and unstructured. For getting total business intelligence, it is important to seamlessly analyze both of them. Especially, as most of business data are unstructured text documents, including the Web pages in Internet, we need a Text OLAP solution to perform multidimensional analysis of text documents in the same way as structured relational data. We first survey the representative works selected for demonstrating how the technologies of text mining and information retrieval can be applied for multidimensional analysis of text documents, because they are major technologies handling text data. And then, we survey the representative works selected for demonstrating how we can associate and consolidate both unstructured text documents and structured relation data for obtaining total business intelligence. Finally, we present a future business intelligence platform architecture as well as related research topics. We expect the proposed total heterogeneous business intelligence architecture, which integrates information retrieval, text mining, and information extraction technologies all together, including relational OLAP technologies, would make a better platform toward total business intelligence.


2020 ◽  
pp. 1686-1704
Author(s):  
Emna Hkiri ◽  
Souheyl Mallat ◽  
Mounir Zrigui

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.


2020 ◽  
Vol 14 (5) ◽  
pp. 779-790
Author(s):  
Ruriko Watanabe ◽  
Nobutada Fujii ◽  
Daisuke Kokuryo ◽  
Toshiya Kaihara ◽  
Yoichi Abe ◽  
...  

This study was conducted to devise a method for supporting consulting service companies in their response to client demands irrespective of the expertise of consultants. With emphasis on revitalization of small and medium-sized enterprises, the importance of support systems for consulting services to serve them is increasing. Those systems must support solutions to difficulties that must be addressed by enterprises. Consulting companies can respond to widely various management consultations. Nevertheless, because the consultation contents are highly specialized, service proposals and problem detection depend on the experience and intuition of the consultant. Often, stable service cannot be provided. A support system must provide stable services independent of the ability of consultants. In this study, analyzing customer information describing the contents of consultation with client companies is the first step in constructing a support system that can predict future problems. Text data such as a consultant’s visit history, consultation contents by e-mail, and contents of call centers are used for analyses because the contents can explain current problems. They might also indicate future problems. This report describes a method to analyze text data using text mining. The target problem is fraud, which includes uncertainty: cases in which it is not clear whether a fraud problem has occurred with the company. To address uncertainty, a method of using logistic regression models is proposed to represent inferred values as probabilities, rather than as binary discriminated data, because the possibility exists that some misidentified companies might have some difficulty. As described herein, computer experiments are conducted to verify the effectiveness of the proposed method and to compare consultants’ forecasted and achieved results. Results of a verification experiment are presented in the following. First, the proposed method is applicable to problems including uncertainties. Secondly, the possibility exists of discovering companies with a fraud problem of which they are unaware.


Author(s):  
Masaomi Kimura ◽  

Text mining has been growing; mainly due to the need to extract useful information from vast amounts of textual data. Our target here is text data, a collection of freely described data from questionnaires. Unlike research papers, newspaper articles, call-center logs and web pages, which are usually the targets of text mining analysis, the freely described data contained in the questionnaire responses have specific characteristics, including a small number of short sentences forming individual pieces of data, while the wide variety of content precludes the applications of clustering algorithms used to classify the same. In this paper, we suggest the way to extract the opinions which are delivered by multiple respondents, based on the modification relationships included in each sentence in the freely described data. Certain applications of our method are also presented after the introduction of our approach.


2001 ◽  
Vol 52 (13) ◽  
pp. 1148-1156 ◽  
Author(s):  
Ronald N. Kostoff ◽  
J. Antonio del Río ◽  
James A. Humenik ◽  
Esther Ofilia García ◽  
Ana María Ramírez

Author(s):  
Lin Wang ◽  
Yanfen Huang ◽  
Muhd Khaizer Omar

The rapid development of networks has resulted in the recognition of blended learning as an effective learning model. The text mining method was used to analyze the blended learning practice data of 17 countries provided by the Christensen Institute. By classifying and extracting text information from the use of blended learning model selection and the challenges of blended learning, the factors that hinder the implementation of blended learning were analyzed. The distribution of blended learning courses and practice models in each country were discussed, as was the influence relationship between the region and implementation model. The results demonstrate that the practice of blended learning in primary and secondary schools is mature in four courses of English, Mathematics, Science, and Social Research. The blended learning implementation model can be unaffected by the region, but more tend to “mix.” The practice cycle of blended learning becomes long and requires long-term stable support, while teachers’ ability and students’ ability preparation are the largest obstacles to the effectual development of blended learning. This study provides references for improving the efficiency of blended learning practices, especially in the aspects of practice model selection, infrastructure preparation, and teacher and student ability training.


Sign in / Sign up

Export Citation Format

Share Document