Ways to build text collections for training classifiers

2021 ◽  
Vol 87 (7) ◽  
pp. 76-84
Author(s):  
N. I. Mulatov ◽  
A. S. Mokhov ◽  
V. О. Tolcheev

We report on solving the problem of forming a Russian-language text collection (dataset) consisting of bibliographic descriptions of scientific articles for training classifiers. Various approaches to creating such collections are considered. The expediency of using expert estimates for assigning class labels is assessed. The known datasets are analyzed, the requirements for the generated text array are formulated, and the choice of the subject area (Computer Science) is justified. We propose a technology of forming collection in conditions of the shortage of Russian-language articles. To do this we use automated translation of publications (bibliographic descriptions) from available English-language electronic libraries (ACM digital library, IEEE Xplore digital library, CiteSeerX) with additional expert quality control of the translation. The bibliographic collection thus formed was studied using methods of clustering (Latent Semantic Analysis) and visualization (Principal Component Analysis). Training and test samples were compiled and «standard» classifiers (K-Nearest Neighbor Method, Logistic Regression, Random Forest) were used. Then we calculated standard quality measures (accuracy, precision, recall). The rigid and soft classification were carried out. For rigid and soft classification all calculated measures (for the studied classifiers) ranged within [0.79; 0.87], and [0.91; 0.95], respectively. The experiments showed almost identical results for Russian and English bibliographic descriptions (the difference did not exceed 2%). The proposed method of forming text collections reduces the complexity of the labeling process compared to the expert approach, solves the problem of the lack of Russian-language documents, allows formation of sufficiently large balanced bibliographic datasets for training and testing classifiers.

Author(s):  
Mohammed Jawad Al Dujaili ◽  
Abbas Ebrahimi-Moghadam ◽  
Ahmed Fatlawi

Recognizing the sense of speech is one of the most active research topics in speech processing and in human-computer interaction programs. Despite a wide range of studies in this scope, there is still a long gap among the natural feelings of humans and the perception of the computer. In general, a sensory recognition system from speech can be divided into three main sections: attribute extraction, feature selection, and classification. In this paper, features of fundamental frequency (FEZ) (F0), energy (E), zero-crossing rate (ZCR), fourier parameter (FP), and various combinations of them are extracted from the data vector, Then, the principal component analysis (PCA) algorithm is used to reduce the number of features. To evaluate the system performance. The fusion of each emotional state will be performed later using support vector machine (SVM), K-nearest neighbor (KNN), In terms of comparison, similar experiments have been performed on the emotional speech of the German language, English language, and significant results were obtained by these comparisons.


Author(s):  
Zhu Siyu ◽  
He Chongnan ◽  
Song Mingjuan ◽  
Li Linna

In response to the frequent counterfeiting of Wuchang rice in the market, an effective method to identify brand rice is proposed. Taking the near-infrared spectroscopy data of a total of 373 grains of rice from the four origins (Wuchang, Shangzhi, Yanshou, and Fangzheng) as the observations, kernel principal component analysis(KPCA) was employed to reduce the dimensionality, and Fisher discriminant analysis(FDA) and k-nearest neighbor algorithm (KNN) were used to identify brand rice respectively. The effects of the two recognition methods are very good, and that of KNN is relatively better. Howerver the shortcomings of KNN are obvious. For instance, it has only one test dimension and its test of samples is not delicate enough. In order to further improve the recognition accuracy, fuzzy k-nearest neighbor set is defined and fuzzy probability theory is employed to get a new recognition method –Two-Parameter KNN discrimination method. Compared with KNN algorithm, this method increases the examination dimension. It not only examines the proportion of the number of samples in each pattern class in the k-nearest neighbor set, but also examines the degree of similarity between the center of each pattern class and the sample to be identified. Therefore, the recognition process is more delicate and the recognition accuracy is higher. In the identification of brand rice, the discriminant accuracy of Two-Parameter KNN algorithm is significantly higher than that of FDA and that of KNN algorithm.


2020 ◽  
Vol 8 (5) ◽  
pp. 2522-2527

In this paper, we design method for recognition of fingerprint and IRIS using feature level fusion and decision level fusion in Children multimodal biometric system. Initially, Histogram of Gradients (HOG), Gabour and Maximum filter response are extracted from both the domains of fingerprint and IRIS and considered for identification accuracy. The combination of feature vector of all the possible features is recommended by biometrics traits of fusion. For fusion vector the Principal Component Analysis (PCA) is used to select features. The reduced features are fed into fusion classifier of K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Navie Bayes(NB). For children multimodal biometric system the suitable combination of features and fusion classifiers is identified. The experimentation conducted on children’s fingerprint and IRIS database and results reveal that fusion combination outperforms individual. In addition the proposed model advances the unimodal biometrics system.


2020 ◽  
Vol 2 (2) ◽  
pp. 29-38
Author(s):  
Abdur Rohman Harits Martawireja ◽  
Hilman Mujahid Purnama ◽  
Atika Nur Rahmawati

Pengenalan wajah manusia (face recognition) merupakan salah satu bidang penelitian yang penting dan belakangan ini banyak aplikasi yang menerapkannya, baik di bidang komersil ataupun di bidang penegakan hukum. Pengenalan wajah merupakan sebuah sistem yang berfungsikan untuk mengidentifikasi berdasarkan ciri-ciri dari wajah seseorang berbasis biometrik yang memiliki keakuratan tinggi. Pengenalan wajah dapat diterapkan pada sistem keamanan. Banyak metode yang dapat digunakan dalam aplikasi pengenalan wajah untuk keamanan sistem, namun pada artikel ini akan membahas tentang dua metode yaitu Two Dimensial Principal Component Analysis dan Kernel Fisher Discriminant Analysis dengan metode klasifikasi menggunakan K-Nearest Neigbor. Kedua metode ini diuji menggunakan metode cross validation. Hasil dari penelitian terdahulu terbukti bahwa sistem pengenalan wajah metode Two Dimensial Principal Component Analysis dengan 5-folds cross validation menghasilkan akurasi sebesar 88,73%, sedangkan dengan 2-folds validation akurasi yang dihasilkan sebesar 89,25%. Dan pengujian metode Kernel Fisher Discriminant dengan 2-folds cross validation menghasilkan akurasi rata rata sebesar 83,10%.


2018 ◽  
Vol 7 (3.33) ◽  
pp. 128
Author(s):  
Ki Young Lee ◽  
Kyu Ho Kim ◽  
Jeong Jin Kang ◽  
Sung Jai Choi ◽  
Yong Soon Im ◽  
...  

Real-time facial expression recognition and analysis technology is recently drawing attention in areas of computer vision, computer graphics, and HCI. Recognition of user’s emotion on the basis of video and voice is drawing particular interest. The technology may help managers of households or hospitals. In the present study, video and voice were converted into digital data through MATLAB by using PCA(Principal Component Analysis), LDA(Linear Discriminant Analysis), KNN(K Nearest Neighbor) algorithms to analyze emotions through machine learning. The manager of the psychological analysis counseling system may understand a user’s emotion in an smart phone environment. This system of the present study may help the manager to have a smooth conversation or develop a smooth relationship with a user on the basis of the provided psychological analysis results. 


Foods ◽  
2019 ◽  
Vol 8 (1) ◽  
pp. 38 ◽  
Author(s):  
Xiaohong Wu ◽  
Jin Zhu ◽  
Bin Wu ◽  
Chao Zhao ◽  
Jun Sun ◽  
...  

The detection of liquor quality is an important process in the liquor industry, and the quality of Chinese liquors is partly determined by the aromas of the liquors. The electronic nose (e-nose) refers to an artificial olfactory technology. The e-nose system can quickly detect different types of Chinese liquors according to their aromas. In this study, an e-nose system was designed to identify six types of Chinese liquors, and a novel feature extraction algorithm, called fuzzy discriminant principal component analysis (FDPCA), was developed for feature extraction from e-nose signals by combining discriminant principal component analysis (DPCA) and fuzzy set theory. In addition, principal component analysis (PCA), DPCA, K-nearest neighbor (KNN) classifier, leave-one-out (LOO) strategy and k-fold cross-validation (k = 5, 10, 20, 25) were employed in the e-nose system. The maximum classification accuracy of feature extraction for Chinese liquors was 98.378% using FDPCA, showing this algorithm to be extremely effective. The experimental results indicate that an e-nose system coupled with FDPCA is a feasible method for classifying Chinese liquors.


2020 ◽  
Vol 1 (1) ◽  
pp. 17-21
Author(s):  
Steve Oscar ◽  
◽  
Mohammed Nazim Uddin ◽  

Modern life is becoming more linked to our devices, and work is being done in a more regulated way. As life became more complicated, it is becoming challenging to keep track of human health and fitness, leading to unexpected illnesses and diseases. Moreover, a lack of activity monitoring and corresponding reminders is preventing the adoption of a healthier lifestyle. This research provides a practical approach for identifying Human Activity by using accelerometer data obtained from wearable devices. The model automatically finds patterns among 33 different physical exercises such as running, rowing, cycling, jogging, etc. and correctly identifies them. The principal component analysis algorithm was used on the statistical features to make the system more robust. Classification of the physical exercise was performed on the reduced features using WEKA. The overall accuracy of 85.51% was obtained using the 10-Fold Cross-Validation method and K nearest Neighbor Algorithm while 84% accuracy for Random Forest. The accuracy obtained was better than previous models and could improve recognition systems in monitoring user activity more precisely.


Sign in / Sign up

Export Citation Format

Share Document