An Analysis on Text Mining Techniques for Smart Literature Review

With the development of web technologies, databases and social networks etc. a large amount of text data is generated each day. Mostof the data on the internet is in unstructured form. This unstructured data can provide valuable knowledge. For getting valuable knowledge from text data text mining techniques are used widely. As each day large amounts of research papers were published in journals and conferences. These research papers are very valuable for future research and investigations. These research papers act as a source for future innovations. Researchers write review papers to give updated knowledge about the specific field. But review papers used a limited number of papers and involved manually reading each paper. Due to the large volume of research papers published each day, it is not possible for the researchers to go through each paper to find the updated knowledge about their field of interest. To automate the literature analysis process different techniques of text mining were used. This paper provides a review of text mining techniques used in automatic literature analysis. We collected papers in which previous literature is used with text mining techniques to get valuable knowledge. This review paper presented an overview of text mining techniques, their evaluation criteria, their limitations and challenges for exploring literature to find research trends.

Author(s):  
Masaomi Kimura ◽  

Text mining has been growing; mainly due to the need to extract useful information from vast amounts of textual data. Our target here is text data, a collection of freely described data from questionnaires. Unlike research papers, newspaper articles, call-center logs and web pages, which are usually the targets of text mining analysis, the freely described data contained in the questionnaire responses have specific characteristics, including a small number of short sentences forming individual pieces of data, while the wide variety of content precludes the applications of clustering algorithms used to classify the same. In this paper, we suggest the way to extract the opinions which are delivered by multiple respondents, based on the modification relationships included in each sentence in the freely described data. Certain applications of our method are also presented after the introduction of our approach.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Dan Zhang

With the rapid development of mobile internet technology, there are a large number of unstructured data in dynamic data, such as text data, multimedia data, etc., so it is essential to analyze and process these unstructured data to obtain potentially valuable information. This article first starts with the theoretical research of text complexity analysis and analyzes the source of text complexity and its five characteristics of dynamic, complexity, concealment, sentiment, and ambiguity, combined with the expression of user needs in the network environment. Secondly, based on the specific process of text mining, namely, data collection, data processing, and data visualization, it is proposed to subdivide the user demand analysis into three stages of text complexity acquisition, recognition, and expression, to obtain a text complexity analysis based on text mining technology. After that, based on computational linguistics and mathematical-statistical analysis, combined with machine learning and information retrieval technology, the text in any format is converted into a content format that can be used for machine learning, and patterns or knowledge are derived from this content format. Then, through the comparison and research of text mining technology, combined with the text complexity analysis hierarchical structure model, a quantitative relationship complexity analysis framework based on text mining technology is proposed, which is embodied in the use of web crawler technology. Experimental results show that the collected quantitative relationship information is identified and expressed in order to realize the conversion of quantitative relationship information into product features. The market data and text data can be integrated to help improve the model performance and the use of text data can further improve predictions for accuracy.


2021 ◽  
Author(s):  
Fei Shen ◽  
Wenting Yu ◽  
Chen Min ◽  
Qianying Ye ◽  
Chuanli Xia ◽  
...  

Text mining has been a dominant approach to extracting useful information from massive unstructured data online. But existing tools for Chinese word segmentation are not ideal for processing social media text data in Cantonese. This project developed CyberCan (https://github.com/shenfei1010/CyberCan), a lexicon of contemporary Cantonese based on more than 100 million pieces of internet texts. We compared the performance of CyberCan with existing Mandarin and Cantonese lexicons in terms of their word segmentation performance. Findings suggest that CyberCan outperforms all existing lexicons by a considerable margin.


2021 ◽  
Vol 54 (7) ◽  
pp. 1-36
Author(s):  
Luciano Ignaczak ◽  
Guilherme Goldschmidt ◽  
Cristiano André Da Costa ◽  
Rodrigo Da Rosa Righi

The growth of data volume has changed cybersecurity activities, demanding a higher level of automation. In this new cybersecurity landscape, text mining emerged as an alternative to improve the efficiency of the activities involving unstructured data. This article proposes a Systematic Literature Review ( SLR ) to present the application of text mining in the cybersecurity domain. Using a systematic protocol, we identified 2,196 studies, out of which 83 were summarized. As a contribution, we propose a taxonomy to demonstrate the different activities in the cybersecurity domain supported by text mining. We also detail the strategies evaluated in the application of text mining tasks and the use of neural networks to support activities involving unstructured data. The work also discusses text classification performance aiming its application in real-world solutions. The SLR also highlights open gaps for future research, such as the analysis of non-English content and the intensification in the usage of neural networks.


Author(s):  
Jonathan S. Lewis

Text mining presents an efficient, scalable method to separate signals and noise in large-scale text data, and therefore to effectively analyze open-ended survey responses as well as the tremendous amount of text that students, faculty, and staff produce through their interactions online. Traditional qualitative methods are impractical when working with these data, and text mining methods are consonant with current literature on thematic analysis. This chapter provides a tutorial for researchers new to this method, including a lengthy discussion of preprocessing tasks and knowledge extraction from both supervised and unsupervised activities, potential data sources, and the range of software (both proprietary and open-source) available to them. Examples are provided throughout the paper of text mining at work in two studies involving data collected from college students. Limitations of this method and implications for future research and policy are discussed.


2017 ◽  
Vol 9 (2) ◽  
pp. 168781401668500 ◽  
Author(s):  
Xiaochuan Li ◽  
Fang Duan ◽  
David Mba ◽  
Ian Bennett

Determining prognosis for rotating machinery could potentially reduce maintenance costs and improve safety and availability. Complex rotating machines are usually equipped with multiple sensors, which enable the development of multidimensional prognostic models. By considering the possible synergy among different sensor signals, multivariate models may provide more accurate prognosis than those using single-source information. Consequently, numerous research papers focusing on the theoretical considerations and practical implementations of multivariate prognostic models have been published in the last decade. However, only a limited number of review papers have been written on the subject. This article focuses on multidimensional prognostic models that have been applied to predict the failures of rotating machinery with multiple sensors. The theory and basic functioning of these techniques, their relative merits and drawbacks and how these models have been used to predict the remnant life of a machine are discussed in detail. Furthermore, this article summarizes the rotating machines to which these models have been applied and discusses future research challenges. The authors also provide seven evaluation criteria that can be used to compare the reviewed techniques. By reviewing the models reported in the literature, this article provides a guide for researchers considering prognosis options for multi-sensor rotating equipment.


Author(s):  
Hiroko Oe ◽  
Max Weeks

This research aims to develop a discussion framework for Kawaii cultural study based on a bibliometric analysis and text mining approach. First, a bibliometric analysis is conducted on literature pertaining to ‘Kawaii and Japanese pop culture’ extracted from the academic database; from this standpoint, the current research topics in the field of Kawaii study are discussed. Second, we aim to provide direction for future research by mining the text data disseminated by three special exhibitions launched by Japanese museums on the theme of ‘Japanese Kawaii culture’ and planned by Kawaii cultural experts and curators. From the results of these two studies, the present research develops a discussion framework containing key dimensions and factors for researchers in this field of study.


2003 ◽  
Vol 4 (6) ◽  
pp. 674-677 ◽  
Author(s):  
Christian Blaschke ◽  
Lynette Hirschman ◽  
Alexander Yeh ◽  
Alfonso Valencia

An increasing number of groups are now working in the area of text mining, focusing on a wide range of problems and applying both statistical and linguistic approaches. However, it is not possible to compare the different approaches, because there are no common standards or evaluation criteria; in addition, the various groups are addressing different problems, often using private datasets. As a result, it is impossible to determine how well the existing systems perform, and particularly what performance level can be expected in real applications. This is similar to the situation in text processing in the late 1980s, prior to the Message Understanding Conferences (MUCs). With the introduction of a common evaluation and standardized evaluation metrics as part of these conferences, it became possible to compare approaches, to identify those techniques that did or did not work and to make progress. This progress has resulted in a common pipeline of processes and a set of shared tools available to the general research community. The field of biology is ripe for a similar experiment. Inspired by this example, the BioLINK group (Biological Literature, Information and Knowledge [1]) is organizing a CASP-like evaluation for the text data-mining community applied to biology. The two main tasks specifically address two major bottlenecks for text mining in biology: (1) the correct detection of gene and protein names in text; and (2) the extraction of functional information related to proteins based on the GO classification system. For further information and participation details, see http://www.pdg.cnb.uam.es/BioLink/BioCreative.eval.html


2020 ◽  
Author(s):  
Amir Karami ◽  
Brandon Bookstaver ◽  
Melissa Nolan

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.


Sign in / Sign up

Export Citation Format

Share Document