STING Algorithm Used English Sentiment Classification in a Parallel Environment

Sentiment classification is significant in everyday life of everyone, in political activities, activities of commodity production, commercial activities. In this research, we propose a new model for Big Data sentiment classification in the parallel network environment. Our new model uses STING Algorithm (SA) (in the data mining field) for English document-level sentiment classification with Hadoop Map (M)/Reduce (R) based on the 90,000 English sentences of the training data set in a Cloudera parallel network environment — a distributed system. In the world there is not any scientific study which is similar to this survey. Our new model can classify sentiment of millions of English documents with the shortest execution time in the parallel network environment. We test our new model on the 25,000 English documents of the testing data set and achieved on 61.2% accuracy. Our English training data set includes 45,000 positive English sentences and 45,000 negative English sentences.

Download Full-text

Latent Semantic Analysis using a Dennis Coefficient for English Sentiment Classification in a Parallel System

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.3.3044 ◽

2018 ◽

Vol 13 (3) ◽

pp. 408-428 ◽

Cited By ~ 4

Author(s):

Phu Vo Ngoc

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Sentiment Classification ◽

Training Data ◽

The Novel ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Novel Model ◽

Better Than

We have already survey many significant approaches for many years because there are many crucial contributions of the sentiment classification which can be applied in everyday life, such as in political activities, commodity production, and commercial activities. We have proposed a novel model using a Latent Semantic Analysis (LSA) and a Dennis Coefficient (DNC) for big data sentiment classification in English. Many LSA vectors (LSAV) have successfully been reformed by using the DNC. We use the DNC and the LSAVs to classify 11,000,000 documents of our testing data set to 5,000,000 documents of our training data set in English. This novel model uses many sentiment lexicons of our basis English sentiment dictionary (bESD). We have tested the proposed model in both a sequential environment and a distributed network system. The results of the sequential system are not as good as that of the parallel environment. We have achieved 88.76% accuracy of the testing data set, and this is better than the accuracies of many previous models of the semantic analysis. Besides, we have also compared the novel model with the previous models, and the experiments and the results of our proposed model are better than that of the previous model. Many different fields can widely use the results of the novel model in many commercial applications and surveys of the sentiment classification.

Download Full-text

On Realistically Attacking Tor with Website Fingerprinting

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2016-0027 ◽

2016 ◽

Vol 2016 (4) ◽

pp. 21-36 ◽

Cited By ~ 25

Author(s):

Tao Wang ◽

Ian Goldberg

Keyword(s):

Background Noise ◽

Laboratory Tests ◽

Training Data ◽

Web Traffic ◽

Training Set ◽

Data Set ◽

Laboratory Conditions ◽

Testing Data ◽

In The Wild ◽

New Algorithms

Abstract Website fingerprinting allows a local, passive observer monitoring a web-browsing client’s encrypted channel to determine her web activity. Previous attacks have shown that website fingerprinting could be a threat to anonymity networks such as Tor under laboratory conditions. However, there are significant differences between laboratory conditions and realistic conditions. First, in laboratory tests we collect the training data set together with the testing data set, so the training data set is fresh, but an attacker may not be able to maintain a fresh data set. Second, laboratory packet sequences correspond to a single page each, but for realistic packet sequences the split between pages is not obvious. Third, packet sequences may include background noise from other types of web traffic. These differences adversely affect website fingerprinting under realistic conditions. In this paper, we tackle these three problems to bridge the gap between laboratory and realistic conditions for website fingerprinting. We show that we can maintain a fresh training set with minimal resources. We demonstrate several classification-based techniques that allow us to split full packet sequences effectively into sequences corresponding to a single page each. We describe several new algorithms for tackling background noise. With our techniques, we are able to build the first website fingerprinting system that can operate directly on packet sequences collected in the wild.

Download Full-text

Selection of optimal external filter for colorimetric camera

Color and Imaging Conference ◽

10.2352/issn.2169-2629.2021.29.141 ◽

2021 ◽

Vol 2021 (29) ◽

pp. 141-147

Author(s):

Michael J. Vrhel ◽

H. Joel Trussell

Keyword(s):

Figure Of Merit ◽

Band System ◽

Image Data ◽

Optimal Filter ◽

Training Data ◽

Noise Model ◽

Data Set ◽

Testing Data ◽

And Training ◽

Selection Of

A database of realizable filters is created and searched to obtain the best filter that, when placed in front of an existing camera, results in improved colorimetric capabilities for the system. The image data with the external filter is combined with image data without the filter to provide a six-band system. The colorimetric accuracy of the system is quantified using simulations that include a realistic signal-dependent noise model. Using a training data set, we selected the optimal filter based on four criteria: Vora Value, Figure of Merit, training average ΔE, and training maximum ΔE. Each selected filter was used on testing data. The filters chosen using the training ΔE criteria consistently outperformed the theoretical criteria.

Download Full-text

Skin Cancer Detection using CNN Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e1079.089620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 45-49

Keyword(s):

Neural Network ◽

Skin Cancer ◽

Convolutional Neural Network ◽

Well Being ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Sequential Model ◽

Pigmented Lesions ◽

Testing Data

The project “Disease Prediction Model” focuses on predicting the type of skin cancer. It deals with constructing a Convolutional Neural Network(CNN) sequential model in order to find the type of a skin cancer which takes a huge troll on mankind well-being. Since development of programmed methods increases the accuracy at high scale for identifying the type of skin cancer, we use Convolutional Neural Network, CNN algorithm in order to build our model . For this we make use of a sequential model. The data set that we have considered for this project is collected from NCBI, which is well known as HAM10000 dataset, it consists of massive amounts of information regarding several dermatoscopic images of most trivial pigmented lesions of skin which are collected from different sufferers. Once the dataset is collected, cleaned, it is split into training and testing data sets. We used CNN to build our model and using the training data we trained the model , later using the testing data we tested the model. Once the model is implemented over the testing data, plots are made in order to analyze the relation between the echos and loss function. It is also used to analyse accuracy and echos for both training and testing data.

Download Full-text

AI Testing: Ensuring a Good Data Split Between Data Sets (Training and Test) using K-means Clustering and Decision Tree Analysis

International Journal on Soft Computing ◽

10.5121/ijsc.2021.12101 ◽

2021 ◽

Vol 12 (1) ◽

pp. 1-11

Author(s):

Kishore Sugali ◽

Chris Sprunger ◽

Venkata N Inukollu

Keyword(s):

Decision Tree ◽

Software Testing ◽

Training Data ◽

Data Sets ◽

Full Data ◽

Data Set ◽

Full Dataset ◽

Development Methodology ◽

Testing Data ◽

Long Time

Artificial Intelligence and Machine Learning have been around for a long time. In recent years, there has been a surge in popularity for applications integrating AI and ML technology. As with traditional development, software testing is a critical component of a successful AI/ML application. The development methodology used in AI/ML contrasts significantly from traditional development. In light of these distinctions, various software testing challenges arise. The emphasis of this paper is on the challenge of effectively splitting the data into training and testing data sets. By applying a k-Means clustering strategy to the data set followed by a decision tree, we can significantly increase the likelihood of the training data set to represent the domain of the full dataset and thus avoid training a model that is likely to fail because it has only learned a subset of the full data domain.

Download Full-text

English Sentiment Classification using Only the Sentiment Lexicons with a JOHNSON Coefficient in a Parallel Network Environment

American Journal of Engineering and Applied Sciences ◽

10.3844/ajeassp.2018.38.65 ◽

2018 ◽

Vol 11 (1) ◽

pp. 38-65 ◽

Cited By ~ 1

Author(s):

Vo Ngoc Phu ◽

Vo Thi Ngoc Tran

Keyword(s):

Sentiment Classification ◽

Network Environment ◽

Parallel Network ◽

English Sentiment Classification

Download Full-text

Reservoir Drought Prediction Using Support Vector Machines

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.145.455 ◽

2011 ◽

Vol 145 ◽

pp. 455-459 ◽

Cited By ~ 6

Author(s):

Jie Lun Chiang ◽

Yu Shiue Tsai

Keyword(s):

Prediction Accuracy ◽

Input Data ◽

Water Shortage ◽

Training Data ◽

Annual Rainfall ◽

Support Vector ◽

Time Interval ◽

Data Set ◽

Drought Prediction ◽

Testing Data

In Taiwan, even though the average annual rainfall is up to 2500 mm, water shortage during the dry season happens sometimes. Especially in recent years, water shortage has seriously affected the agriculture, industry, commerce, and even the essential daily water use. Under the threat of climate change in the future, efficient use of water resources becomes even more challenging. For a comparative study, support vector machine (SVM) and other three models (artificial neural networks, maximum likelihood classifier, Bayesian classifier) were established to predict reservoir drought status in next 10-90 days in Tsengwen Reservoir. (The ten-days time interval was applied in this study as it is the conventional time unit for reservoir operation.) Four features (which are easily obtainable in most reservoir offices), including reservoir storage capacity, inflows, critical limit of operation rule curves, and the number of ten-days in a year, were used as input data to predict drought. The records of years from 1975 to 1999 were selected as training data, and those of years from 2000 to 2010 were selected as testing data. The empirical results showed that SVM outperforms the other three approaches for drought prediction. Unsurprisingly the longer the prediction time period is, the lower the prediction accuracy is. However, the accuracy of predicting next 50 days is about 85% both in training and testing data set by SVM. As a result, we believe that the SVM model has high potential for predicting reservoir drought due to its high prediction accuracy and simple input data.

Download Full-text

An Ensemble Approach for Robust Data-Driven Prognostics

Volume 3: 38th Design Automation Conference, Parts A and B ◽

10.1115/detc2012-70529 ◽

2012 ◽

Author(s):

Chao Hu ◽

Byeng D. Youn ◽

Pingfeng Wang ◽

Joung Taek Yoon

Keyword(s):

Case Studies ◽

Nuclear Power ◽

Remaining Useful Life ◽

Training Data ◽

Data Driven ◽

Error Estimator ◽

Data Set ◽

Weighting Schemes ◽

Ensemble Approach ◽

Testing Data

Prognostics aims at determining whether a failure of an engineered system (e.g., a nuclear power plant) is impending and estimating the remaining useful life (RUL) before the failure occurs. The traditional data-driven prognostic approach involves the following three steps: (Step 1) construct multiple candidate algorithms using a training data set; (Step 2) evaluate their respective performance using a testing data set; and (Step 3) select the one with the best performance while discarding all the others. There are three main challenges in the traditional data-driven prognostic approach: (i) lack of robustness in the selected standalone algorithm; (ii) waste of the resources for constructing the algorithms that are discarded; and (iii) demand for the testing data in addition to the training data. To address these challenges, this paper proposes an ensemble approach for data-driven prognostics. This approach combines multiple member algorithms with a weighted-sum formulation where the weights are estimated by using one of the three weighting schemes, namely the accuracy-based weighting, diversity-based weighting and optimization-based weighting. In order to estimate the prediction error required by the accuracy- and optimization-based weighting schemes, we propose the use of the k-fold cross validation (CV) as a robust error estimator. The performance of the proposed ensemble approach is verified with three engineering case studies. It can be seen from all the case studies that the ensemble approach achieves better accuracy in RUL predictions compared to any sole algorithm when the member algorithms with good diversity show comparable prediction accuracy.

Download Full-text

VGG16 Transfer Learning Architecture for Salak Fruit Quality Classification

Telematika ◽

10.31315/telematika.v18i1.4025 ◽

2021 ◽

Vol 18 (1) ◽

pp. 37

Author(s):

Rismiyati Rismiyati ◽

Ardytha Luthfiarta

Keyword(s):

Machine Learning ◽

Transfer Learning ◽

State Of The Art ◽

Learning Rate ◽

Training Data ◽

Data Set ◽

Quality Classification ◽

Testing Data ◽

Pixel Value

Purpose: This study aims to differentiate the quality of salak fruit with machine learning. Salak is classified into two classes, good and bad class.Design/methodology/approach: The algorithm used in this research is transfer learning with the VGG16 architecture. Data set used in this research consist of 370 images of salak, 190 from good class and 180 from bad class. The image is preprocessed by resizing and normalizing pixel value in the image. Preprocessed images is split into 80% training data and 20% testing data. Training data is trained by using pretrained VGG16 model. The parameters that are changed during the training are epoch, momentum, and learning rate. The resulting model is then used for testing. The accuracy, precision and recall is monitored to determine the best model to classify the images.Findings/result: The highest accuracy obtained from this study is 95.83%. This accuracy is obtained by using a learning rate = 0.0001 and momentum 0.9. The precision and recall for this model is 97.2 and 94.6.Originality/value/state of the art: The use of transfer learning to classify salak which never been used before.

Download Full-text

Analysis for Clients Churn of Credit Cards in Model Construction in Banking Industry

Proceedings of Business and Economic Studies ◽

10.26689/pbes.v3i2.1165 ◽

2020 ◽

Vol 3 (2) ◽

Author(s):

Jianyao Liu

Keyword(s):

Data Mining ◽

Decision Tree ◽

Financial Market ◽

Banking Industry ◽

Credit Cards ◽

Training Data ◽

Tree Model ◽

Data Set ◽

Mining Technology ◽

Testing Data

Data mining technology has been more and more important in the economics and financial market. Helping the banks to predict a customers’ behavior, which is that whether the existing customers will continue use their credit cards or not, we utilize the data mining technology to construct a convenient and effective model, Decision Tree. By using our Decision Tree model, which can classify the customers according to different features step by step, the banks are able to predict the customers’ behavior well. The main steps of our experiment includes collecting statistics from the bank, utilizing Min-Max normalization to preprocess the data set, employing the training data set to construct our model, examining the model by testing data set, and analyzing the results.

Download Full-text