nearest neighbors
Recently Published Documents





Raheem Sarwar ◽  
Saeed-Ul Hassan

The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.

2022 ◽  
Vol 2022 ◽  
pp. 1-11
Yunsheng Song ◽  
Xiaohan Kong ◽  
Chao Zhang

Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the k -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale k -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch k -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.

2022 ◽  
Vol 15 (2) ◽  
pp. 247-260
Jin-Ting Zhang ◽  
Tianming Zhu

2022 ◽  
pp. 195-208
Guy E. Blelloch ◽  
Magdalen Dobson

Baydaulet Urmashev ◽  
Zholdas Buribayev ◽  
Zhazira Amirgaliyeva ◽  
Aisulu Ataniyazova ◽  
Mukhtar Zhassuzak ◽  

The detection of weeds at the stages of cultivation is very important for detecting and preventing plant diseases and eliminating significant crop losses, and traditional methods of performing this process require large costs and human resources, in addition to exposing workers to the risk of contamination with harmful chemicals. To solve the above tasks, also in order to save herbicides and pesticides, to obtain environmentally friendly products, a program for detecting agricultural pests using the classical K-Nearest Neighbors, Random Forest and Decision Tree algorithms, as well as YOLOv5 neural network, is proposed. After analyzing the geographical areas of the country, from the images of the collected weeds, a proprietary database with more than 1000 images for each class was formed. A brief review of the researchers' scientific papers describing the methods they developed for identifying, classifying and discriminating weeds based on machine learning algorithms, convolutional neural networks and deep learning algorithms is given. As a result of the research, a weed detection system based on the YOLOv5 architecture was developed and quality estimates of the above algorithms were obtained. According to the results of the assessment, the accuracy of weed detection by the K-Nearest Neighbors, Random Forest and Decision Tree classifiers was 83.3 %, 87.5 %, and 80 %. Due to the fact that the images of weeds of each species differ in resolution and level of illumination, the results of the neural network have corresponding indicators in the intervals of 0.82–0.92 for each class. Quantitative results obtained on real data demonstrate that the proposed approach can provide good results in classifying low-resolution images of weeds.

2021 ◽  
Vol 12 (1) ◽  
pp. 115
Khongorzul Dashdondov ◽  
Mi-Hwa Song

Natural gas (NG), typically methane, is released into the air, causing significant air pollution and environmental and health problems. Nowadays, there is a need to use machine-based methods to predict gas losses widely. In this article, we proposed to predict NG leakage levels through feature selection based on a factorial analysis (FA) of the USA’s urban natural gas open data. The paper has been divided into three sections. First, we select essential features using FA. Then, the dataset is labeled by k-means clustering with OrdinalEncoder (OE)-based normalization. The final module uses five algorithms (extreme gradient boost (XGBoost), K-nearest neighbors (KNN), decision tree (DT), random forest (RF), Naive Bayes (NB), and multilayer perceptron (MLP)) to predict gas leakage levels. The proposed method is evaluated by the accuracy, F1-score, mean standard error (MSE), and area under the ROC curve (AUC). The test results indicate that the F-OE-based classification method has improved successfully. Moreover, F-OE-based XGBoost (F-OE-XGBoost) showed the best performance by giving 95.14% accuracy, an F1-score of 95.75%, an MSE of 0.028, and an AUC of 96.29%. Following these, the second-best outcomes of an accuracy rate of 95.09%, F1-score of 95.60%, MSE of 0.029, and AUC of 96.11% were achieved by the F-OE-RF model.

Sign in / Sign up

Export Citation Format

Share Document