Business environmental analysis for textual data using data mining and sentence-level classification

2019 ◽  
Vol 119 (1) ◽  
pp. 69-88 ◽  
Author(s):  
Yoon-Sung Kim ◽  
Hae-Chang Rim ◽  
Do-Gil Lee

Purpose The purpose of this paper is to propose a methodology to analyze a large amount of unstructured textual data into categories of business environmental analysis frameworks. Design/methodology/approach This paper uses machine learning to classify a vast amount of unstructured textual data by category of business environmental analysis framework. Generally, it is difficult to produce high quality and massive training data for machine-learning-based system in terms of cost. Semi-supervised learning techniques are used to improve the classification performance. Additionally, the lack of feature problem that traditional classification systems have suffered is resolved by applying semantic features by utilizing word embedding, a new technique in text mining. Findings The proposed methodology can be used for various business environmental analyses and the system is fully automated in both the training and classifying phases. Semi-supervised learning can solve the problems with insufficient training data. The proposed semantic features can be helpful for improving traditional classification systems. Research limitations/implications This paper focuses on classifying sentences that contain the information of business environmental analysis in large amount of documents. However, the proposed methodology has a limitation on the advanced analyses which can directly help managers establish strategies, since it does not summarize the environmental variables that are implied in the classified sentences. Using the advanced summarization and recommendation techniques could extract the environmental variables among the sentences, and they can assist managers to establish effective strategies. Originality/value The feature selection technique developed in this paper has not been used in traditional systems for business and industry, so that the whole process can be fully automated. It also demonstrates practicality so that it can be applied to various business environmental analysis frameworks. In addition, the system is more economical than traditional systems because of semi-supervised learning, and can resolve the lack of feature problem that traditional systems suffer. This work is valuable for analyzing environmental factors and establishing strategies for companies.

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Huu-Thanh Duong ◽  
Tram-Anh Nguyen-Thi

AbstractIn literature, the machine learning-based studies of sentiment analysis are usually supervised learning which must have pre-labeled datasets to be large enough in certain domains. Obviously, this task is tedious, expensive and time-consuming to build, and hard to handle unseen data. This paper has approached semi-supervised learning for Vietnamese sentiment analysis which has limited datasets. We have summarized many preprocessing techniques which were performed to clean and normalize data, negation handling, intensification handling to improve the performances. Moreover, data augmentation techniques, which generate new data from the original data to enrich training data without user intervention, have also been presented. In experiments, we have performed various aspects and obtained competitive results which may motivate the next propositions.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Nasser Assery ◽  
Yuan (Dorothy) Xiaohong ◽  
Qu Xiuli ◽  
Roy Kaushik ◽  
Sultan Almalki

Purpose This study aims to propose an unsupervised learning model to evaluate the credibility of disaster-related Twitter data and present a performance comparison with commonly used supervised machine learning models. Design/methodology/approach First historical tweets on two recent hurricane events are collected via Twitter API. Then a credibility scoring system is implemented in which the tweet features are analyzed to give a credibility score and credibility label to the tweet. After that, supervised machine learning classification is implemented using various classification algorithms and their performances are compared. Findings The proposed unsupervised learning model could enhance the emergency response by providing a fast way to determine the credibility of disaster-related tweets. Additionally, the comparison of the supervised classification models reveals that the Random Forest classifier performs significantly better than the SVM and Logistic Regression classifiers in classifying the credibility of disaster-related tweets. Originality/value In this paper, an unsupervised 10-point scoring model is proposed to evaluate the tweets’ credibility based on the user-based and content-based features. This technique could be used to evaluate the credibility of disaster-related tweets on future hurricanes and would have the potential to enhance emergency response during critical events. The comparative study of different supervised learning methods has revealed effective supervised learning methods for evaluating the credibility of Tweeter data.


2021 ◽  
Author(s):  
Haibin Di ◽  
Chakib Kada Kloucha ◽  
Cen Li ◽  
Aria Abubakar ◽  
Zhun Li ◽  
...  

Abstract Delineating seismic stratigraphic features and depositional facies is of importance to successful reservoir mapping and identification in the subsurface. Robust seismic stratigraphy interpretation is confronted with two major challenges. The first one is to maximally automate the process particularly with the increasing size of seismic data and complexity of target stratigraphies, while the second challenge is to efficiently incorporate available structures into stratigraphy model building. Machine learning, particularly convolutional neural network (CNN), has been introduced into assisting seismic stratigraphy interpretation through supervised learning. However, the small amount of available expert labels greatly restricts the performance of such supervised CNN. Moreover, most of the exiting CNN implementations are based on only amplitude, which fails to use necessary structural information such as faults for constraining the machine learning. To resolve both challenges, this paper presents a semi-supervised learning workflow for fault-guided seismic stratigraphy interpretation, which consists of two components. The first component is seismic feature engineering (SFE), which aims at learning the provided seismic and fault data through a unsupervised convolutional autoencoder (CAE), while the second one is stratigraphy model building (SMB), which aims at building an optimal mapping function between the features extracted from the SFE CAE and the target stratigraphic labels provided by an experienced interpreter through a supervised CNN. Both components are connected by embedding the encoder of the SFE CAE into the SMB CNN, which forces the SMB learning based on these features commonly existing in the entire study area instead of those only at the limited training data; correspondingly, the risk of overfitting is greatly eliminated. More innovatively, the fault constraint is introduced by customizing the SMB CNN of two output branches, with one to match the target stratigraphies and the other to reconstruct the input fault, so that the fault continues contributing to the process of SMB learning. The performance of such fault-guided seismic stratigraphy interpretation is validated by an application to a real seismic dataset, and the machine prediction not only matches the manual interpretation accurately but also clearly illustrates the depositional process in the study area.


2020 ◽  
Vol 27 (8) ◽  
pp. 1891-1912
Author(s):  
Hengqin Wu ◽  
Geoffrey Shen ◽  
Xue Lin ◽  
Minglei Li ◽  
Boyu Zhang ◽  
...  

PurposeThis study proposes an approach to solve the fundamental problem in using query-based methods (i.e. searching engines and patent retrieval tools) to screen patents of information and communication technology in construction (ICTC). The fundamental problem is that ICTC incorporates various techniques and thus cannot be simply represented by man-made queries. To investigate this concern, this study develops a binary classifier by utilizing deep learning and NLP techniques to automatically identify whether a patent is relevant to ICTC, thus accurately screening a corpus of ICTC patents.Design/methodology/approachThis study employs NLP techniques to convert the textual data of patents into numerical vectors. Then, a supervised deep learning model is developed to learn the relations between the input vectors and outputs.FindingsThe validation results indicate that (1) the proposed approach has a better performance in screening ICTC patents than traditional machine learning methods; (2) besides the United States Patent and Trademark Office (USPTO) that provides structured and well-written patents, the approach could also accurately screen patents form Derwent Innovations Index (DIX), in which patents are written in different genres.Practical implicationsThis study contributes a specific collection for ICTC patents, which is not provided by the patent offices.Social implicationsThe proposed approach contributes an alternative manner in gathering a corpus of patents for domains like ICTC that neither exists as a searchable classification in patent offices, nor is accurately represented by man-made queries.Originality/valueA deep learning model with two layers of neurons is developed to learn the non-linear relations between the input features and outputs providing better performance than traditional machine learning models. This study uses advanced NLP techniques lemmatization and part-of-speech POS to process textual data of ICTC patents. This study contributes specific collection for ICTC patents which is not provided by the patent offices.


Text classification and clustering approach is essential for big data environments. In supervised learning applications many classification algorithms have been proposed. In the era of big data, a large volume of training data is available in many machine learning works. However, there is a possibility of mislabeled or unlabeled data that are not labeled properly. Some labels may be incorrect resulted in label noise which in turn regress learning performance of a classifier. A general approach to address label noise is to apply noise filtering techniques to identify and remove noise before learning. A range of noise filtering approaches have been developed to improve the classifiers performance. This paper proposes noise filtering approach in text data during the training phase. Many supervised learning algorithms generates high error rates due to noise in training dataset, our work eliminates such noise and provides accurate classification system.


2020 ◽  
Vol 115 (3) ◽  
pp. 1839-1867
Author(s):  
Piotr Nawrocki ◽  
Bartlomiej Sniezynski

AbstractIn this paper we present an original adaptive task scheduling system, which optimizes the energy consumption of mobile devices using machine learning mechanisms and context information. The system learns how to allocate resources appropriately: how to schedule services/tasks optimally between the device and the cloud, which is especially important in mobile systems. Decisions are made taking the context into account (e.g. network connection type, location, potential time and cost of executing the application or service). In this study, a supervised learning agent architecture and service selection algorithm are proposed to solve this problem. Adaptation is performed online, on a mobile device. Information about the context, task description, the decision made and its results such as power consumption are stored and constitute training data for a supervised learning algorithm, which updates the knowledge used to determine the optimal location for the execution of a given type of task. To verify the solution proposed, appropriate software has been developed and a series of experiments have been conducted. Results show that as a result of the experience gathered and the learning process performed, the decision module has become more efficient in assigning the task to either the mobile device or cloud resources.


2019 ◽  
Vol 30 (1) ◽  
pp. 45-66 ◽  
Author(s):  
Anette Rantanen ◽  
Joni Salminen ◽  
Filip Ginter ◽  
Bernard J. Jansen

Purpose User-generated social media comments can be a useful source of information for understanding online corporate reputation. However, the manual classification of these comments is challenging due to their high volume and unstructured nature. The purpose of this paper is to develop a classification framework and machine learning model to overcome these limitations. Design/methodology/approach The authors create a multi-dimensional classification framework for the online corporate reputation that includes six main dimensions synthesized from prior literature: quality, reliability, responsibility, successfulness, pleasantness and innovativeness. To evaluate the classification framework’s performance on real data, the authors retrieve 19,991 social media comments about two Finnish banks and use a convolutional neural network (CNN) to classify automatically the comments based on manually annotated training data. Findings After parameter optimization, the neural network achieves an accuracy between 52.7 and 65.2 percent on real-world data, which is reasonable given the high number of classes. The findings also indicate that prior work has not captured all the facets of online corporate reputation. Practical implications For practical purposes, the authors provide a comprehensive classification framework for online corporate reputation, which companies and organizations operating in various domains can use. Moreover, the authors demonstrate that using a limited amount of training data can yield a satisfactory multiclass classifier when using CNN. Originality/value This is the first attempt at automatically classifying online corporate reputation using an online-specific classification framework.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pengcheng Li ◽  
Qikai Liu ◽  
Qikai Cheng ◽  
Wei Lu

Purpose This paper aims to identify data set entities in scientific literature. To address poor recognition caused by a lack of training corpora in existing studies, a distant supervised learning-based approach is proposed to identify data set entities automatically from large-scale scientific literature in an open domain. Design/methodology/approach Firstly, the authors use a dictionary combined with a bootstrapping strategy to create a labelled corpus to apply supervised learning. Secondly, a bidirectional encoder representation from transformers (BERT)-based neural model was applied to identify data set entities in the scientific literature automatically. Finally, two data augmentation techniques, entity replacement and entity masking, were introduced to enhance the model generalisability and improve the recognition of data set entities. Findings In the absence of training data, the proposed method can effectively identify data set entities in large-scale scientific papers. The BERT-based vectorised representation and data augmentation techniques enable significant improvements in the generality and robustness of named entity recognition models, especially in long-tailed data set entity recognition. Originality/value This paper provides a practical research method for automatically recognising data set entities in scientific literature. To the best of the authors’ knowledge, this is the first attempt to apply distant learning to the study of data set entity recognition. The authors introduce a robust vectorised representation and two data augmentation strategies (entity replacement and entity masking) to address the problem inherent in distant supervised learning methods, which the existing research has mostly ignored. The experimental results demonstrate that our approach effectively improves the recognition of data set entities, especially long-tailed data set entities.


2018 ◽  
Vol 42 (3) ◽  
pp. 343-354 ◽  
Author(s):  
Mike Thelwall

Purpose The purpose of this paper is to investigate whether machine learning induces gender biases in the sense of results that are more accurate for male authors or for female authors. It also investigates whether training separate male and female variants could improve the accuracy of machine learning for sentiment analysis. Design/methodology/approach This paper uses ratings-balanced sets of reviews of restaurants and hotels (3 sets) to train algorithms with and without gender selection. Findings Accuracy is higher on female-authored reviews than on male-authored reviews for all data sets, so applications of sentiment analysis using mixed gender data sets will over represent the opinions of women. Training on same gender data improves performance less than having additional data from both genders. Practical implications End users of sentiment analysis should be aware that its small gender biases can affect the conclusions drawn from it and apply correction factors when necessary. Users of systems that incorporate sentiment analysis should be aware that performance will vary by author gender. Developers do not need to create gender-specific algorithms unless they have more training data than their system can cope with. Originality/value This is the first demonstration of gender bias in machine learning sentiment analysis.


Energies ◽  
2021 ◽  
Vol 14 (4) ◽  
pp. 1081
Author(s):  
Spyros Theocharides ◽  
Marios Theristis ◽  
George Makrides ◽  
Marios Kynigos ◽  
Chrysovalantis Spanias ◽  
...  

A main challenge for integrating the intermittent photovoltaic (PV) power generation remains the accuracy of day-ahead forecasts and the establishment of robust performing methods. The purpose of this work is to address these technological challenges by evaluating the day-ahead PV production forecasting performance of different machine learning models under different supervised learning regimes and minimal input features. Specifically, the day-ahead forecasting capability of Bayesian neural network (BNN), support vector regression (SVR), and regression tree (RT) models was investigated by employing the same dataset for training and performance verification, thus enabling a valid comparison. The training regime analysis demonstrated that the performance of the investigated models was strongly dependent on the timeframe of the train set, training data sequence, and application of irradiance condition filters. Furthermore, accurate results were obtained utilizing only the measured power output and other calculated parameters for training. Consequently, useful information is provided for establishing a robust day-ahead forecasting methodology that utilizes calculated input parameters and an optimal supervised learning approach. Finally, the obtained results demonstrated that the optimally constructed BNN outperformed all other machine learning models achieving forecasting accuracies lower than 5%.


Sign in / Sign up

Export Citation Format

Share Document