STING Algorithm Used English Sentiment Classification in a Parallel Environment

Author(s):  
Nguyen Duy Dat ◽  
Vo Ngoc Phu ◽  
Vo Thi Ngoc Tran ◽  
Vo Thi Ngoc Chau ◽  
Tuan A. Nguyen

Sentiment classification is significant in everyday life of everyone, in political activities, activities of commodity production, commercial activities. In this research, we propose a new model for Big Data sentiment classification in the parallel network environment. Our new model uses STING Algorithm (SA) (in the data mining field) for English document-level sentiment classification with Hadoop Map (M)/Reduce (R) based on the 90,000 English sentences of the training data set in a Cloudera parallel network environment — a distributed system. In the world there is not any scientific study which is similar to this survey. Our new model can classify sentiment of millions of English documents with the shortest execution time in the parallel network environment. We test our new model on the 25,000 English documents of the testing data set and achieved on 61.2% accuracy. Our English training data set includes 45,000 positive English sentences and 45,000 negative English sentences.

2018 ◽  
Vol 13 (3) ◽  
pp. 408-428 ◽  
Author(s):  
Phu Vo Ngoc

We have already survey many significant approaches for many years because there are many crucial contributions of the sentiment classification which can be applied in everyday life, such as in political activities, commodity production, and commercial activities. We have proposed a novel model using a Latent Semantic Analysis (LSA) and a Dennis Coefficient (DNC) for big data sentiment classification in English. Many LSA vectors (LSAV) have successfully been reformed by using the DNC. We use the DNC and the LSAVs to classify 11,000,000 documents of our testing data set to 5,000,000 documents of our training data set in English. This novel model uses many sentiment lexicons of our basis English sentiment dictionary (bESD). We have tested the proposed model in both a sequential environment and a distributed network system. The results of the sequential system are not as good as that of the parallel environment. We have achieved 88.76% accuracy of the testing data set, and this is better than the accuracies of many previous models of the semantic analysis. Besides, we have also compared the novel model with the previous models, and the experiments and the results of our proposed model are better than that of the previous model. Many different fields can widely use the results of the novel model in many commercial applications and surveys of the sentiment classification.


2016 ◽  
Vol 2016 (4) ◽  
pp. 21-36 ◽  
Author(s):  
Tao Wang ◽  
Ian Goldberg

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.


2021 ◽  
Vol 2021 (29) ◽  
pp. 141-147
Author(s):  
Michael J. Vrhel ◽  
H. Joel Trussell

A database of realizable filters is created and searched to obtain the best filter that, when placed in front of an existing camera, results in improved colorimetric capabilities for the system. The image data with the external filter is combined with image data without the filter to provide a six-band system. The colorimetric accuracy of the system is quantified using simulations that include a realistic signal-dependent noise model. Using a training data set, we selected the optimal filter based on four criteria: Vora Value, Figure of Merit, training average ΔE, and training maximum ΔE. Each selected filter was used on testing data. The filters chosen using the training ΔE criteria consistently outperformed the theoretical criteria.


The project “Disease Prediction Model” focuses on predicting the type of skin cancer. It deals with constructing a Convolutional Neural Network(CNN) sequential model in order to find the type of a skin cancer which takes a huge troll on mankind well-being. Since development of programmed methods increases the accuracy at high scale for identifying the type of skin cancer, we use Convolutional Neural Network, CNN algorithm in order to build our model . For this we make use of a sequential model. The data set that we have considered for this project is collected from NCBI, which is well known as HAM10000 dataset, it consists of massive amounts of information regarding several dermatoscopic images of most trivial pigmented lesions of skin which are collected from different sufferers. Once the dataset is collected, cleaned, it is split into training and testing data sets. We used CNN to build our model and using the training data we trained the model , later using the testing data we tested the model. Once the model is implemented over the testing data, plots are made in order to analyze the relation between the echos and loss function. It is also used to analyse accuracy and echos for both training and testing data.


2021 ◽  
Vol 12 (1) ◽  
pp. 1-11
Author(s):  
Kishore Sugali ◽  
Chris Sprunger ◽  
Venkata N Inukollu

Artificial Intelligence and Machine Learning have been around for a long time. In recent years, there has been a surge in popularity for applications integrating AI and ML technology. As with traditional development, software testing is a critical component of a successful AI/ML application. The development methodology used in AI/ML contrasts significantly from traditional development. In light of these distinctions, various software testing challenges arise. The emphasis of this paper is on the challenge of effectively splitting the data into training and testing data sets. By applying a k-Means clustering strategy to the data set followed by a decision tree, we can significantly increase the likelihood of the training data set to represent the domain of the full dataset and thus avoid training a model that is likely to fail because it has only learned a subset of the full data domain.


2011 ◽  
Vol 145 ◽  
pp. 455-459 ◽  
Author(s):  
Jie Lun Chiang ◽  
Yu Shiue Tsai

In Taiwan, even though the average annual rainfall is up to 2500 mm, water shortage during the dry season happens sometimes. Especially in recent years, water shortage has seriously affected the agriculture, industry, commerce, and even the essential daily water use. Under the threat of climate change in the future, efficient use of water resources becomes even more challenging. For a comparative study, support vector machine (SVM) and other three models (artificial neural networks, maximum likelihood classifier, Bayesian classifier) were established to predict reservoir drought status in next 10-90 days in Tsengwen Reservoir. (The ten-days time interval was applied in this study as it is the conventional time unit for reservoir operation.) Four features (which are easily obtainable in most reservoir offices), including reservoir storage capacity, inflows, critical limit of operation rule curves, and the number of ten-days in a year, were used as input data to predict drought. The records of years from 1975 to 1999 were selected as training data, and those of years from 2000 to 2010 were selected as testing data. The empirical results showed that SVM outperforms the other three approaches for drought prediction. Unsurprisingly the longer the prediction time period is, the lower the prediction accuracy is. However, the accuracy of predicting next 50 days is about 85% both in training and testing data set by SVM. As a result, we believe that the SVM model has high potential for predicting reservoir drought due to its high prediction accuracy and simple input data.


Author(s):  
Chao Hu ◽  
Byeng D. Youn ◽  
Pingfeng Wang ◽  
Joung Taek Yoon

Prognostics aims at determining whether a failure of an engineered system (e.g., a nuclear power plant) is impending and estimating the remaining useful life (RUL) before the failure occurs. The traditional data-driven prognostic approach involves the following three steps: (Step 1) construct multiple candidate algorithms using a training data set; (Step 2) evaluate their respective performance using a testing data set; and (Step 3) select the one with the best performance while discarding all the others. There are three main challenges in the traditional data-driven prognostic approach: (i) lack of robustness in the selected standalone algorithm; (ii) waste of the resources for constructing the algorithms that are discarded; and (iii) demand for the testing data in addition to the training data. To address these challenges, this paper proposes an ensemble approach for data-driven prognostics. This approach combines multiple member algorithms with a weighted-sum formulation where the weights are estimated by using one of the three weighting schemes, namely the accuracy-based weighting, diversity-based weighting and optimization-based weighting. In order to estimate the prediction error required by the accuracy- and optimization-based weighting schemes, we propose the use of the k-fold cross validation (CV) as a robust error estimator. The performance of the proposed ensemble approach is verified with three engineering case studies. It can be seen from all the case studies that the ensemble approach achieves better accuracy in RUL predictions compared to any sole algorithm when the member algorithms with good diversity show comparable prediction accuracy.


Telematika ◽  
2021 ◽  
Vol 18 (1) ◽  
pp. 37
Author(s):  
Rismiyati Rismiyati ◽  
Ardytha Luthfiarta

Purpose: This study aims to differentiate the quality of salak fruit with machine learning. Salak is classified into two classes, good and bad class.Design/methodology/approach: The algorithm used in this research is transfer learning with the VGG16 architecture. Data set used in this research consist of 370 images of salak, 190 from good class and 180 from bad class. The image is preprocessed by resizing and normalizing pixel value in the image. Preprocessed images is split into 80% training data and 20% testing data. Training data is trained by using pretrained VGG16 model. The parameters that are changed during the training are epoch, momentum, and learning rate. The resulting model is then used for testing. The accuracy, precision and recall is monitored to determine the best model to classify the images.Findings/result: The highest accuracy obtained from this study is 95.83%. This accuracy is obtained by using a learning rate = 0.0001 and momentum 0.9. The precision and recall for this model is 97.2 and 94.6.Originality/value/state of the art: The use of transfer learning to classify salak which never been used before.


2020 ◽  
Vol 3 (2) ◽  
Author(s):  
Jianyao Liu

Data mining technology has been more and more important in the economics and financial market. Helping the banks to predict a customers’ behavior, which is that whether the existing customers will continue use their credit cards or not, we utilize the data mining technology to construct a convenient and effective model, Decision Tree. By using our Decision Tree model, which can classify the customers according to different features step by step, the banks are able to predict the customers’ behavior well. The main steps of our experiment includes collecting statistics from the bank, utilizing Min-Max normalization to preprocess the data set, employing the training data set to construct our model, examining the model by testing data set, and analyzing the results.


Sign in / Sign up

Export Citation Format

Share Document