Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Author(s):  
Nikulin Vladimir

Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of the stability of the influence of the particular features. It is proposed to keep in the model only features with stable influence. The final model represents an average of the single models, which are not necessarily a linear regression. The above model had proven to be efficient and competitive during the PAKDD-2007 Data Mining Competition.

Author(s):  
Anisa Anisa ◽  
Mesran Mesran

Data mining is mining or discovery information to the process of looking for patterns or information that contains the search trends in a number of very large data in taking decisions on the future.In determining the patterns of classification techniques garnered record (Training set). The class attribute, which is a decision tree with method C 4.5 builds upon an algorithm of induction can be minimised.By utilizing data jobs graduates expected to generate information about interest & talent, work with benefit from graduate quisioner alumni. A pattern of work that sought from large-scale data and analyzed by various algorithms to compute the C 4.5 can do that work based on the pattern of investigation patterns that affect so that it found the rules are interconnected that can result from the results of the classification of objects of different classes or categories of attributes that influence to shape the patterns of work. The application used is software that used Tanagra data mining for academic and research purposes.That contains data mining method explored starting from the data analysis, and classification data mining.Keywords: analysis, Data Mining, method C 4.5, Tanagra, patterns of work


Author(s):  
Ghulam Fatima ◽  
Sana Saeed

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Haiyan Wang ◽  
Peidi Xu ◽  
Jinghua Zhao

The KNN algorithm is one of the most famous algorithms in machine learning and data mining. It does not preprocess the data before classification, which leads to longer time and more errors. To solve the problems, this paper first proposes a PK-means++ algorithm, which can better ensure the stability of a random experiment. Then, based on it and spherical region division, an improved KNNPK+ is proposed. The algorithm can select the center of the spherical region appropriately and then construct an initial classifier for the training set to improve the accuracy and time of classification.


2021 ◽  
Vol 12 (2) ◽  
Author(s):  
Mohammad Haekal ◽  
Henki Bayu Seta ◽  
Mayanda Mega Santoni

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.


Author(s):  
Fransiskus Ginting ◽  
Efori Buulolo ◽  
Edward Robinson Siagian

Data Mining is an information discovery by extracting information patterns that contain trend searches in a very large amount of data and assist the process of storing data in making a decision in the future. In determining the pattern classification techniques do to collect records (Training set). Regional income is generally derived from local taxes and levies, local taxes are one source of funding for the region on the national average has not been able to make a large contribution to the formation of local revenue. By utilizing Regional Revenue data, it can produce forecasting and predictions of Regional Revenue income in the future to match the reality / reality so that the planned RAPBD can run smoothly. Simple Linear Regression or often abbreviated as SLR (Simple Linear Regression) is one of the statistical methods used in production to make predictions or predictions about the characteristics of quality and quantity to describe the processes associated with data processing for the acquisition of regional income. So that in the testing phase with visual basic net can help in processing valid Regional Revenue Amount data. Keywords: Data Mining, Local Revenue, Simple Linear Regression Algorithm, Visual Basic net 2008


2018 ◽  
Vol 35 (4) ◽  
pp. 133-136
Author(s):  
R. N. Ibragimov

The article examines the impact of internal and external risks on the stability of the financial system of the Altai Territory. Classification of internal and external risks of decline, affecting the sustainable development of the financial system, is presented. A risk management strategy is proposed that will allow monitoring of risks, thereby these measures will help reduce the loss of financial stability and ensure the long-term development of the economy of the region.


Author(s):  
Рубен Косян ◽  
Ruben Kosyan ◽  
Viacheslav Krylenko ◽  
Viacheslav Krylenko

There are many types of coasts classifications that indicate main coastal features. As a rule, the "static" state of the coasts is considered regardless of their evolutionary features and ways to further transformation. Since the most part of the coastal zone studies aimed at ensuring of economic activity, it is clear that the classification of coast types should indicate total information required by the users. Accordingly, the coast classification should include the criterion, characterizing as dynamic features of the coast and the conditions and opportunities of economic activity. The coast classification, of course, should be based on geomorphological coast typification. Similar typification has been developed by leading scientists from Russia and can be used with minimal modifications. The authors propose to add to basic information (geomorphological type of coast) the evaluative part for each coast sector. It will include the estimation of the coast changes probability and the complexity of the coast stabilization for economic activity. This method will allow to assess the dynamics of specific coastal sections and the processes intensity and, as a result – the stability of the coastal area.


Author(s):  
Rodrigo Madurga ◽  
Noemí García-Romero ◽  
Beatriz Jiménez ◽  
Ana Collazo ◽  
Francisco Pérez-Rodríguez ◽  
...  

Abstract Molecular classification of glioblastoma has enabled a deeper understanding of the disease. The four-subtype model (including Proneural, Classical, Mesenchymal and Neural) has been replaced by a model that discards the Neural subtype, found to be associated with samples with a high content of normal tissue. These samples can be misclassified preventing biological and clinical insights into the different tumor subtypes from coming to light. In this work, we present a model that tackles both the molecular classification of samples and discrimination of those with a high content of normal cells. We performed a transcriptomic in silico analysis on glioblastoma (GBM) samples (n = 810) and tested different criteria to optimize the number of genes needed for molecular classification. We used gene expression of normal brain samples (n = 555) to design an additional gene signature to detect samples with a high normal tissue content. Microdissection samples of different structures within GBM (n = 122) have been used to validate the final model. Finally, the model was tested in a cohort of 43 patients and confirmed by histology. Based on the expression of 20 genes, our model is able to discriminate samples with a high content of normal tissue and to classify the remaining ones. We have shown that taking into consideration normal cells can prevent errors in the classification and the subsequent misinterpretation of the results. Moreover, considering only samples with a low content of normal cells, we found an association between the complexity of the samples and survival for the three molecular subtypes.


Sign in / Sign up

Export Citation Format

Share Document