scholarly journals Implementation of Cosine Similarity in an automatic classifier for comments

2019 ◽  
Vol 3 (2) ◽  
pp. 110
Author(s):  
Muhammad Habibi

Classification of text with a large amount is needed to extract the information contained in it. Student comments containing suggestions and criticisms about the lecturer and the lecture process on the learning evaluation system are not well classified, resulting in a difficult assessment process. So from that, we need a classification model that can classify comments automatically into classification categories. The method used is the Cosine Similarity method, which is a method for calculating similarities between two objects expressed in two vectors. The data used in this study were 1,630 comment data with several different categories. The test in this study uses k-fold cross-validation with k = 10. The results showed that the percentage accuracy of the classification model was 80.87%.


Author(s):  
Abdul Kholik ◽  
Agus Harjoko ◽  
Wahyono Wahyono

The volume density of vehicles is a problem that often occurs in every city, as for the impact of vehicle density is congestion. Classification of vehicle density levels on certain roads is required because there are at least 7 vehicle density level conditions. Monitoring conducted by the police, the Department of Transportation and the organizers of the road currently using video-based surveillance such as CCTV that is still monitored by people manually. Deep Learning is an approach of synthetic neural network-based learning machines that are actively developed and researched lately because it has succeeded in delivering good results in solving various soft-computing problems, This research uses the convolutional neural network architecture. This research tries to change the supporting parameters on the convolutional neural network to further calibrate the maximum accuracy. After the experiment changed the parameters, the classification model was tested using K-fold cross-validation, confusion matrix and model exam with data testing. On the K-fold cross-validation test with an average yield of 92.83% with a value of K (fold) = 5, model testing is done by entering data testing amounting to 100 data, the model can predict or classify correctly i.e. 81 data.



Author(s):  
Piska dwi Nurfadila ◽  
Aji Prasetya Wibawa ◽  
Ilham Ari Elbaith Zaeni ◽  
Andrew Nafalski

Classification of economic journal articles has been done using the VSM (Vector Space Model) approach and the Cosine Similarity method. The results of previous studies are considered to be less optimal because Stopword Removal was carried out by using a dictionary of basic words (tuning). Therefore, the omitted words limited to only basic words. This study shows the improved performance accuracy of the Cosine Similarity method using frequency-based Stopword Removal. The reason is because the term with a certain frequency is assumed to be an insignificant word and will give less relevant results. Performance testing of the Cosine Similarity method that had been added to frequency-based Stopword Removal was done by using K-fold Cross Validation. The method performance produced accuracy value for 64.28%, precision for 64.76 %, and recall for 65.26%. The execution time after pre-processing was 0, 05033 second.



2021 ◽  
Vol 39 (15_suppl) ◽  
pp. 3044-3044
Author(s):  
David Haan ◽  
Anna Bergamaschi ◽  
Yuhong Ning ◽  
William Gibb ◽  
Michael Kesling ◽  
...  

3044 Background: Epigenomics assays have recently become popular tools for identification of molecular biomarkers, both in tissue and in plasma. In particular 5-hydroxymethyl-cytosine (5hmC) method, has been shown to enable the epigenomic regulation of gene expression and subsequent gene activity, with different patterns, across several tumor and normal tissues types. In this study we show that 5hmC profiles enable discrete classification of tumor and normal tissue for breast, colorectal, lung ovary and pancreas. Such classification was also recapitulated in cfDNA from patient with breast, colorectal, lung, ovarian and pancreatic cancers. Methods: DNA was isolated from 176 fresh frozen tissues from breast, colorectal, lung, ovary and pancreas (44 per tumor per tissue type and up to 11 tumor tissues for each stage (I-IV)) and up to 10 normal tissues per tissue type. cfDNA was isolated from plasma from 783 non-cancer individuals and 569 cancer patients. Plasma-isolated cfDNA and tumor genomic DNA, were enriched for the 5hmC fraction using chemical labelling, sequenced, and aligned to a reference genome to construct features sets of 5hmC patterns. Results: 5hmC multinomial logistic regression analysis was employed across tumor and normal tissues and identified a set of specific and discrete tumor and normal tissue gene-based features. This indicates that we can classify samples regardless of source, with a high degree of accuracy, based on tissue of origin and also distinguish between normal and tumor status.Next, we employed a stacked ensemble machine learning algorithm combining multiple logistic regression models across diverse feature sets to the cfDNA dataset composed of 783 non cancers and 569 cancers comprising 67 breast, 118 colorectal, 210 Lung, 71 ovarian and 100 pancreatic cancers. We identified a genomic signature that enable the classification of non-cancer versus cancers with an outer fold cross validation sensitivity of 49% (CI 45%-53%) at 99% specificity. Further, individual cancer outer fold cross validation sensitivity at 99% specificity, was measured as follows: breast 30% (CI 119% -42%); colorectal 41% (CI 32%-50%); lung 49% (CI 42%-56%); ovarian 72% (CI 60-82%); pancreatic 56% (CI 46%-66%). Conclusions: This study demonstrates that 5hmC profiles can distinguish cancer and normal tissues based on their origin. Further, 5hmC changes in cfDNA enables detection of the several cancer types: breast, colorectal, lung, ovarian and pancreatic cancers. Our technology provides a non-invasive tool for cancer detection with low risk sample collection enabling improved compliance than current screening methods. Among other utilities, we believe our technology could be applied to asymptomatic high-risk individuals thus enabling enrichment for those subjects that most need a diagnostic imaging follow up.



Mekatronika ◽  
2021 ◽  
Vol 3 (1) ◽  
pp. 27-31
Author(s):  
Ken-ji Ee ◽  
Ahmad Fakhri Bin Ab. Nasir ◽  
Anwar P. P. Abdul Majeed ◽  
Mohd Azraai Mohd Razman ◽  
Nur Hafieza Ismail

The animal classification system is a technology to classify the animal class (type) automatically and useful in many applications. There are many types of learning models applied to this technology recently. Nonetheless, it is worth noting that the extraction of the features and the classification of the animal features is non-trivial, particularly in the deep learning approach for a successful animal classification system. The use of Transfer Learning (TL) has been demonstrated to be a powerful tool in the extraction of essential features. However, the employment of such a method towards animal classification applications are somewhat limited. The present study aims to determine a suitable TL-conventional classifier pipeline for animal classification. The VGG16 and VGG19 were used in extracting features and then coupled with either k-Nearest Neighbour (k-NN) or Support Vector Machine (SVM) classifier. Prior to that, a total of 4000 images were gathered consisting of a total of five classes which are cows, goats, buffalos, dogs, and cats. The data was split into the ratio of 80:20 for train and test. The classifiers hyper parameters are tuned by the Grids Search approach that utilises the five-fold cross-validation technique. It was demonstrated from the study that the best TL pipeline identified is the VGG16 along with an optimised SVM, as it was able to yield an average classification accuracy of 0.975. The findings of the present investigation could facilitate animal classification application, i.e. for monitoring animals in wildlife.



2020 ◽  
Vol 10 (6) ◽  
pp. 1999 ◽  
Author(s):  
Milica M. Badža ◽  
Marko Č. Barjaktarović

The classification of brain tumors is performed by biopsy, which is not usually conducted before definitive brain surgery. The improvement of technology and machine learning can help radiologists in tumor diagnostics without invasive measures. A machine-learning algorithm that has achieved substantial results in image segmentation and classification is the convolutional neural network (CNN). We present a new CNN architecture for brain tumor classification of three tumor types. The developed network is simpler than already-existing pre-trained networks, and it was tested on T1-weighted contrast-enhanced magnetic resonance images. The performance of the network was evaluated using four approaches: combinations of two 10-fold cross-validation methods and two databases. The generalization capability of the network was tested with one of the 10-fold methods, subject-wise cross-validation, and the improvement was tested by using an augmented image database. The best result for the 10-fold cross-validation method was obtained for the record-wise cross-validation for the augmented data set, and, in that case, the accuracy was 96.56%. With good generalization capability and good execution speed, the new developed CNN architecture could be used as an effective decision-support tool for radiologists in medical diagnostics.



2013 ◽  
Vol 658 ◽  
pp. 647-651 ◽  
Author(s):  
Jun Jie Zhu ◽  
Xiao Jun Zhang ◽  
Ji Hua Gu ◽  
He Ming Zhao ◽  
Qiang Zhou ◽  
...  

This paper mainly studies on the classification of pathological voice from normal voice based on the sustained vowel /a/. Firstly, the original 18 acoustic features are extracted. Then on the basis of the extracted parameters, this paper recognizes the pathological voice using AD Tree. During the classification stage, the cross-validation of features is also as references in the process. This method is validated with a sound database provided by the Massachusetts Eye and Ear Infirmary (MEEI). After the 10 fold cross-validation, comparing with 7 other kinds of classifiers, the experimental results show that AD Tree can get the highest recognition rate of 95.2%. The method in this paper shows that all the extracted parameters are reasonable in the following recognition process and AD tree is a good recognition way in pathological voice research.



2008 ◽  
Vol 17 (05) ◽  
pp. 957-971
Author(s):  
ATAOLLAH EBRAHIMZADEH ◽  
ABOLFAZL RANJBAR ◽  
MEHRDAD ARDEBLILPOUR

Classification of the communication signals has seen under increasing demands. In this paper, we present a new technique that identifies a variety of digital communication signal types. This technique utilizes a radial basis function neural network (RBFN) as the classifier. Swarm intelligence, as an evolutionary algorithm, is used to construct RBFN. A combination of the higher-order moments and the higher-order cumulants up to eight are selected as the features of the considered digital signal types. In conjunction with RBFN, we have used k-fold cross-validation to improve the generalization potentiality. Simulation results show that the proposed technique has high performance for classification of different communication signals even at very low signal-to-noise ratios.



Author(s):  
Gede Aditra Pradnyana ◽  
I Komang Agus Suryantara ◽  
I Gede Mahendra Darmawiguna

An impression can be interpreted as a psychological feeling toward a product and it plays an important role in decision making. Therefore, the understanding of the data in the domain of impressions will be very useful. This research had the objective of knowing the performance of K-Nearest Neighbors method to classify endek image impression using K-Fold Cross Validation method. The images were taken from 3 locations, namely CV. Artha Dharma, Agung Bali Collection, and Pengrajin Sri Rejeki. To get the image impression was done by consulting with an endek expert named Dr. D.A Tirta Ray, M.Si. The process of data mining was done by using K-Nearest Neighbors Method which was a classification method to a set of data based on learning data that had been classified previously and to classify new objects based on attributes and training samples. K-Fold Cross Validation testing obtained accuracy of 91% with K value in K-Nearest Neighbors of 3, 4, 7, 8.



2020 ◽  
Vol 9 (3) ◽  
pp. 376-390
Author(s):  
Nur Fitriyah ◽  
Budi Warsito ◽  
Di Asih I Maruddani

Appearance of PT Aplikasi Karya Anak Bangsa or as known as Gojek since 2015 give a convenience facility to people in Indonesia especially in daily activities. Sentiment analysis on Twitter social media can be the option to see how Gojek users respond to the services that have been provided. The response was classified into positive sentiment and negative sentiment using Support Vector Machine method with model evaluation 10-fold cross validation. The kernel used is the linear kernel and the RBF kernel. Data labeling can be done with manually and sentiment scoring. The test results showed that the RBF kernel gets overall accuracy and the highest kappa accuracy on manual data labeling and sentiment scoring. On manual data labeling, the overall accuracy is 79.19% and kappa accuracy is 16.52%. While the labeling of data with sentiment scoring obtained overall accuracy of 79.19% and kappa accuracy of 21%. The greater overall accuracy value and kappa accuracy obtained, the better performance of the classification model. Keywords: Gojek, Twitter, Support Vector Machine, overall accuracy, kappa accuracy



Revista Vitae ◽  
2019 ◽  
Vol 26 (2) ◽  
pp. 94-103
Author(s):  
Hugo Italo Romero ◽  
Ivan RAMÍREZ-MORALES ◽  
Cinthia ROMERO FLORES

Background: concern about the quality of the water for human consumption has become widespread among the population. The taste and some problems associated with drinking water have been the cause of increased demand for bottled water. Due to this, day to day, a large number of companies has manifested their interest in the production of bottled water. Objective: to evaluate a novel automatic classification model that differentiates bottled water from tap water. Methods: the voltammetric technique consisted of three electrode setup. The output current has been considered for data analysis. From the results of grid search, six pairs of values were pre-selected for the parameters of σ and C whose results were similar. High values of accuracy, specificity and sensitivity were achieved in test dataset. The final decision was made after performing an ANOVA test of 100 repetitions of 5-fold cross-validation, 3000 models were evaluated with the parameter combinations described above for the SVM. Results: the oxidation and reduction peaks of the water samples have been observed to be prominent. Absolute values of current (I) increased in the case of public water samples, possibly due to the largest concentration of chloride ions which have higher contributions to the conductivity. 5-fold cross-validation test mean specificity resulted in C parameters values greater than 0 and between 0 and 30; a σ value greater than 10 and between 0 and 15 were found for tap water and bottled water, respectively. The combination (σ = 10, C = 30) presented best results in accuracy 0.988 ± 0.037, specificity 0.973 ± 0.085 and sensitivity 1 ± 0.09. Conclusions: results of this research work have shown that voltammograms for values of current increased for tap water samples, 9.94e-6μA, compared to 7.99e-6μA due to higher chloride ions concentration in the former. The parameters combination (σ = 10, C = 20) was selected as optimal parameters since there were no significant difference between this and the former.



Sign in / Sign up

Export Citation Format

Share Document