Analysis of Textual Data Based on Inductive Learning Techniques

2013 ◽  
Vol 3 (2) ◽  
pp. 40-57
Author(s):  
Shigeaki Sakurai

This paper introduces knowledge discovery methods based on inductive learning techniques from textual data. The author argues three methods extracting features of the textual data. First one activates a key concept dictionary, second one does a key phrase pattern dictionary, and third one does a named entity extractor. These features are used in order to generate rules representing relationships between the features and text classes. The rules are described in the format of a fuzzy decision tree. Also, these features are used in order to acquire a classification model based on SVM (Support Vector Machine). The model can classify new textual data into the text classes with high classification accuracy. Lastly, this paper introduces two application tasks based on these methods and verifies the effect of the methods.

Sensors ◽  
2020 ◽  
Vol 21 (1) ◽  
pp. 194
Author(s):  
Sarah Gonzalez ◽  
Paul Stegall ◽  
Harvey Edwards ◽  
Leia Stirling ◽  
Ho Chit Siu

The field of human activity recognition (HAR) often utilizes wearable sensors and machine learning techniques in order to identify the actions of the subject. This paper considers the activity recognition of walking and running while using a support vector machine (SVM) that was trained on principal components derived from wearable sensor data. An ablation analysis is performed in order to select the subset of sensors that yield the highest classification accuracy. The paper also compares principal components across trials to inform the similarity of the trials. Five subjects were instructed to perform standing, walking, running, and sprinting on a self-paced treadmill, and the data were recorded while using surface electromyography sensors (sEMGs), inertial measurement units (IMUs), and force plates. When all of the sensors were included, the SVM had over 90% classification accuracy using only the first three principal components of the data with the classes of stand, walk, and run/sprint (combined run and sprint class). It was found that sensors that were placed only on the lower leg produce higher accuracies than sensors placed on the upper leg. There was a small decrease in accuracy when the force plates are ablated, but the difference may not be operationally relevant. Using only accelerometers without sEMGs was shown to decrease the accuracy of the SVM.


Author(s):  
Wanli Wang ◽  
Botao Zhang ◽  
Kaiqi Wu ◽  
Sergey A Chepinskiy ◽  
Anton A Zhilenkov ◽  
...  

In this paper, a hybrid method based on deep learning is proposed to visually classify terrains encountered by mobile robots. Considering the limited computing resource on mobile robots and the requirement for high classification accuracy, the proposed hybrid method combines a convolutional neural network with a support vector machine to keep a high classification accuracy while improve work efficiency. The key idea is that the convolutional neural network is used to finish a multi-class classification and simultaneously the support vector machine is used to make a two-class classification. The two-class classification performed by the support vector machine is aimed at one kind of terrain that users are mostly concerned with. Results of the two classifications will be consolidated to get the final classification result. The convolutional neural network used in this method is modified for the on-board usage of mobile robots. In order to enhance efficiency, the convolutional neural network has a simple architecture. The convolutional neural network and the support vector machine are trained and tested by using RGB images of six kinds of common terrains. Experimental results demonstrate that this method can help robots classify terrains accurately and efficiently. Therefore, the proposed method has a significant potential for being applied to the on-board usage of mobile robots.


Author(s):  
Alok Kumar Shukla ◽  
Pradeep Singh ◽  
Manu Vardhan

The explosion of the high-dimensional dataset in the scientific repository has been encouraging interdisciplinary research on data mining, pattern recognition and bioinformatics. The fundamental problem of the individual Feature Selection (FS) method is extracting informative features for classification model and to seek for the malignant disease at low computational cost. In addition, existing FS approaches overlook the fact that for a given cardinality, there can be several subsets with similar information. This paper introduces a novel hybrid FS algorithm, called Filter-Wrapper Feature Selection (FWFS) for a classification problem and also addresses the limitations of existing methods. In the proposed model, the front-end filter ranking method as Conditional Mutual Information Maximization (CMIM) selects the high ranked feature subset while the succeeding method as Binary Genetic Algorithm (BGA) accelerates the search in identifying the significant feature subsets. One of the merits of the proposed method is that, unlike an exhaustive method, it speeds up the FS procedure without lancing of classification accuracy on reduced dataset when a learning model is applied to the selected subsets of features. The efficacy of the proposed (FWFS) method is examined by Naive Bayes (NB) classifier which works as a fitness function. The effectiveness of the selected feature subset is evaluated using numerous classifiers on five biological datasets and five UCI datasets of a varied dimensionality and number of instances. The experimental results emphasize that the proposed method provides additional support to the significant reduction of the features and outperforms the existing methods. For microarray data-sets, we found the lowest classification accuracy is 61.24% on SRBCT dataset and highest accuracy is 99.32% on Diffuse large B-cell lymphoma (DLBCL). In UCI datasets, the lowest classification accuracy is 40.04% on the Lymphography using k-nearest neighbor (k-NN) and highest classification accuracy is 99.05% on the ionosphere using support vector machine (SVM).


Author(s):  
Malcolm Beynon

The general fuzzy decision tree approach encapsulates the benefits of being an inductive learning technique to classify objects, utilising the richness of the data being considered, as well as the readability and interpretability that accompanies its operation in a fuzzy environment. This chapter offers a description of fuzzy decision tree based research, including the exposition of small and large fuzzy decision trees to demonstrate their construction and practicality. The two large fuzzy decision trees described are associated with a real application, namely, the identification of workplace establishments in the United Kingdom that pay a noticeable proportion of their employees less than the legislated minimum wage. Two separate fuzzy decision tree analyses are undertaken on a low-pay database, which utilise different numbers of membership functions to fuzzify the continuous attributes describing the investigated establishments. The findings demonstrate the sensitivity of results when there are changes in the compactness of the fuzzy representation of the associated data.


2020 ◽  
Vol 24 (5) ◽  
pp. 1141-1160
Author(s):  
Tomás Alegre Sepúlveda ◽  
Brian Keith Norambuena

In this paper, we apply sentiment analysis methods in the context of the first round of the 2017 Chilean elections. The purpose of this work is to estimate the voting intention associated with each candidate in order to contrast this with the results from classical methods (e.g., polls and surveys). The data are collected from Twitter, because of its high usage in Chile and in the sentiment analysis literature. We obtained tweets associated with the three main candidates: Sebastián Piñera (SP), Alejandro Guillier (AG) and Beatriz Sánchez (BS). For each candidate, we estimated the voting intention and compared it to the traditional methods. To do this, we first acquired the data and labeled the tweets as positive or negative. Afterward, we built a model using machine learning techniques. The classification model had an accuracy of 76.45% using support vector machines, which yielded the best model for our case. Finally, we use a formula to estimate the voting intention from the number of positive and negative tweets for each candidate. For the last period, we obtained a voting intention of 35.84% for SP, compared to a range of 34–44% according to traditional polls and 36% in the actual elections. For AG we obtained an estimate of 37%, compared with a range of 15.40% to 30.00% for traditional polls and 20.27% in the elections. For BS we obtained an estimate of 27.77%, compared with the range of 8.50% to 11.00% given by traditional polls and an actual result of 22.70% in the elections. These results are promising, in some cases providing an estimate closer to reality than traditional polls. Some differences can be explained due to the fact that some candidates have been omitted, even though they held a significant number of votes.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Jianghui Wen ◽  
Yeshu Liu ◽  
Yu Shi ◽  
Haoran Huang ◽  
Bing Deng ◽  
...  

Abstract Background Long-chain non-coding RNA (lncRNA) is closely related to many biological activities. Since its sequence structure is similar to that of messenger RNA (mRNA), it is difficult to distinguish between the two based only on sequence biometrics. Therefore, it is particularly important to construct a model that can effectively identify lncRNA and mRNA. Results First, the difference in the k-mer frequency distribution between lncRNA and mRNA sequences is considered in this paper, and they are transformed into the k-mer frequency matrix. Moreover, k-mers with more species are screened by relative entropy. The classification model of the lncRNA and mRNA sequences is then proposed by inputting the k-mer frequency matrix and training the convolutional neural network. Finally, the optimal k-mer combination of the classification model is determined and compared with other machine learning methods in humans, mice and chickens. The results indicate that the proposed model has the highest classification accuracy. Furthermore, the recognition ability of this model is verified to a single sequence. Conclusion We established a classification model for lncRNA and mRNA based on k-mers and the convolutional neural network. The classification accuracy of the model with 1-mers, 2-mers and 3-mers was the highest, with an accuracy of 0.9872 in humans, 0.8797 in mice and 0.9963 in chickens, which is better than those of the random forest, logistic regression, decision tree and support vector machine.


Author(s):  
YUTING SU ◽  
JING ZHANG ◽  
YU HAN ◽  
JING CHEN ◽  
QINGZHONG LIU

A novel approach for detecting video logo-removal forgery is proposed by measuring inconsistency of blur. Our approach is based on the assumption that if a digital video undergoes logo-removal forgery; the blurriness of the forged region is expected to be different as compared to the nontampered parts of the video. Blurriness is first estimated by analyzing the spatial and temporal statistical property of logo areas, and suspicious areas are roughly located; then features are extracted and a fine classification is implemented by applying support vector machine (SVM) to extract features. If the suspicious areas and the reference areas are classified into different classes, the video is judged as a forged video. Experimental results show that our method is robust to video lossy compression for logo-removal forgery detection with the advantages of high classification accuracy and low computation cost.


2020 ◽  
Vol 13 (1-2) ◽  
pp. 43-52
Author(s):  
Boudewijn van Leeuwen ◽  
Zalán Tobak ◽  
Ferenc Kovács

AbstractClassification of multispectral optical satellite data using machine learning techniques to derive land use/land cover thematic data is important for many applications. Comparing the latest algorithms, our research aims to determine the best option to classify land use/land cover with special focus on temporary inundated land in a flat area in the south of Hungary. These inundations disrupt agricultural practices and can cause large financial loss. Sentinel 2 data with a high temporal and medium spatial resolution is classified using open source implementations of a random forest, support vector machine and an artificial neural network. Each classification model is applied to the same data set and the results are compared qualitatively and quantitatively. The accuracy of the results is high for all methods and does not show large overall differences. A quantitative spatial comparison demonstrates that the neural network gives the best results, but that all models are strongly influenced by atmospheric disturbances in the image.


2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Zijin Wu

With the development of the country’s economy, there is a flourishing situation in the field of culture and art. However, the diversification of artistic expressions has not brought development to folk music. On the contrary, it brought a huge impact, and some national music even fell into the dilemma of being lost. This article is mainly aimed at the recognition and classification of folk music emotions and finds the model that can make the classification accuracy rate as high as possible. The classification model used in this article is mainly after determining the use of Support Vector Machine (SVM) classification method, a variety of attempts have been made to feature extraction, and good results have been achieved. Explore the Deep Belief Network (DBN) pretraining and reverse fine-tuning process, using DBN to learn the fusion characteristics of music. According to the abstract characteristics learned by them, the recognition and classification of folk music emotions are carried out. The DBN is improved by adding “Dropout” to each Restricted Boltzmann Machine (RBM) and adjusting the increase standard of weight and bias. The improved network can avoid the overfitting problem and speed up the training of the network. Through experiments, it is found that using the fusion features proposed in this paper, through classification, the classification accuracy has been improved.


Information ◽  
2019 ◽  
Vol 10 (6) ◽  
pp. 186 ◽  
Author(s):  
Ajees A P ◽  
Manju K ◽  
Sumam Mary Idicula

Named Entity Recognition (NER) is the process of identifying the elementary units in a text document and classifying them into predefined categories such as person, location, organization and so forth. NER plays an important role in many Natural Language Processing applications like information retrieval, question answering, machine translation and so forth. Resolving the ambiguities of lexical items involved in a text document is a challenging task. NER in Indian languages is always a complex task due to their morphological richness and agglutinative nature. Even though different solutions were proposed for NER, it is still an unsolved problem. Traditional approaches to Named Entity Recognition were based on the application of hand-crafted features to classical machine learning techniques such as Hidden Markov Model (HMM), Support Vector Machine (SVM), Conditional Random Field (CRF) and so forth. But the introduction of deep learning techniques to the NER problem changed the scenario, where the state of art results have been achieved using deep learning architectures. In this paper, we address the problem of effective word representation for NER in Indian languages by capturing the syntactic, semantic and morphological information. We propose a deep learning based entity extraction system for Indian languages using a novel combined word representation, including character-level, word-level and affix-level embeddings. We have used ‘ARNEKT-IECSIL 2018’ shared data for training and testing. Our results highlight the improvement that we obtained over the existing pre-trained word representations.


Sign in / Sign up

Export Citation Format

Share Document