Classification of Imbalanced Data with Random Sets and Mean-Variance Filtering

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies ◽

10.4018/978-1-60566-717-1.ch022 ◽

2011 ◽

pp. 338-354

Author(s):

Nikulin Vladimir

Keyword(s):

Data Mining ◽

Linear Regression ◽

Imbalanced Data ◽

Random Sets ◽

Significant Problem ◽

Training Set ◽

Final Model ◽

The Stability ◽

Mean Variance

Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of the stability of the influence of the particular features. It is proposed to keep in the model only features with stable influence. The final model represents an average of the single models, which are not necessarily a linear regression. The above model had proven to be efficient and competitive during the PAKDD-2007 Data Mining Competition.

Download Full-text

Classification of Imbalanced Data with Random sets and Mean-Variance Filtering

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2008040108 ◽

2008 ◽

Vol 4 (2) ◽

pp. 63-78 ◽

Cited By ~ 17

Author(s):

Vladimir Nikulin

Keyword(s):

Imbalanced Data ◽

Random Sets ◽

Mean Variance

Download Full-text

ANALISA POLA PEKERJAAN LULUSAN STMIK BUDI DARMA MENERAPKAN METODE C4.5

KOMIK (Konferensi Nasional Teknologi Informasi dan Komputer) ◽

10.30865/komik.v2i1.974 ◽

2018 ◽

Vol 2 (1) ◽

Author(s):

Anisa Anisa ◽

Mesran Mesran

Keyword(s):

Data Mining ◽

Large Scale ◽

Large Data ◽

Analysis Data ◽

Mining Method ◽

Training Set ◽

Data Mining Method ◽

Large Scale Data ◽

Scale Data

Data mining is mining or discovery information to the process of looking for patterns or information that contains the search trends in a number of very large data in taking decisions on the future.In determining the patterns of classification techniques garnered record (Training set). The class attribute, which is a decision tree with method C 4.5 builds upon an algorithm of induction can be minimised.By utilizing data jobs graduates expected to generate information about interest & talent, work with benefit from graduate quisioner alumni. A pattern of work that sought from large-scale data and analyzed by various algorithms to compute the C 4.5 can do that work based on the pattern of investigation patterns that affect so that it found the rules are interconnected that can result from the results of the classification of objects of different classes or categories of attributes that influence to shape the patterns of work. The application used is software that used Tanagra data mining for academic and research purposes.That contains data mining method explored starting from the data analysis, and classification data mining.Keywords: analysis, Data Mining, method C 4.5, Tanagra, patterns of work

Download Full-text

A Novel Weighted Ensemble Method to Overcome the Impact of Under-fitting and Over-fitting on the Classification Accuracy of the Imbalanced Data Sets

Pakistan Journal of Statistics and Operation Research ◽

10.18187/pjsor.v17i2.3640 ◽

2021 ◽

pp. 483-496

Author(s):

Ghulam Fatima ◽

Sana Saeed

Keyword(s):

Data Mining ◽

Imbalanced Data ◽

Ensemble Method ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Imbalanced Data Sets ◽

Sampling Procedures ◽

The Impact

In the data mining communal, imbalanced class dispersal data sets have established mounting consideration. The evolving field of data mining and information discovery seeks to establish precise and effective computational tools for the investigation of such data sets to excerpt innovative facts from statistics. Sampling methods re-balance the imbalanced data sets consequently improve the enactment of classifiers. For the classification of the imbalanced data sets, over-fitting and under-fitting are the two striking problems. In this study, a novel weighted ensemble method is anticipated to diminish the influence of over-fitting and under-fitting while classifying these kinds of data sets. Forty imbalanced data sets with varying imbalance ratios are engaged to conduct a comparative study. The enactment of the projected method is compared with four customary classifiers including decision tree(DT), k-nearest neighbor (KNN), support vector machines (SVM), and neural network (NN). This evaluation is completed with two over-sampling procedures, an adaptive synthetic sampling approach (ADASYN), and a synthetic minority over-sampling (SMOTE) technique. The projected scheme remained efficacious in diminishing the impact of over-fitting and under-fitting on the classification of these data sets.

Download Full-text

Improved KNN Algorithm Based on Preprocessing of Center in Smart Cities

Complexity ◽

10.1155/2021/5524388 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Haiyan Wang ◽

Peidi Xu ◽

Jinghua Zhao

Keyword(s):

Machine Learning ◽

Data Mining ◽

Smart Cities ◽

Training Set ◽

Spherical Region ◽

Random Experiment ◽

Region Division ◽

The Stability

The KNN algorithm is one of the most famous algorithms in machine learning and data mining. It does not preprocess the data before classification, which leads to longer time and more errors. To solve the problems, this paper first proposes a PK-means++ algorithm, which can better ensure the stability of a random experiment. Then, based on it and spherical region division, an improved KNNPK+ is proposed. The algorithm can select the center of the spherical region appropriately and then construct an initial classifier for the training set to improve the accuracy and time of classification.

Download Full-text

The stability of aptitude area standard scores as a basis for classification of army enlisted personnel

PsycEXTRA Dataset ◽

10.1037/e518512009-001 ◽

1952 ◽

Author(s):

Edmund F. Fuchs ◽

Bertha Harper ◽

Emma Brown ◽

Joseph Zeidner

Keyword(s):

Enlisted Personnel ◽

The Stability ◽

Area Standard

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

IMPLEMENTASI ALGORITMA REGRESI LINEAR SEDERHANA DALAM MEMPREDIKSI BESARAN PENDAPATAN DAERAH (STUDI KASUS: DINAS PENDAPATAN KAB. DELI SERDANG)

KOMIK (Konferensi Nasional Teknologi Informasi dan Komputer) ◽

10.30865/komik.v3i1.1602 ◽

2019 ◽

Vol 3 (1) ◽

Author(s):

Fransiskus Ginting ◽

Efori Buulolo ◽

Edward Robinson Siagian

Keyword(s):

Data Mining ◽

Linear Regression ◽

Visual Basic ◽

Large Contribution ◽

Simple Linear Regression ◽

Information Discovery ◽

Regional Income ◽

Local Revenue ◽

Local Taxes ◽

The Future

Data Mining is an information discovery by extracting information patterns that contain trend searches in a very large amount of data and assist the process of storing data in making a decision in the future. In determining the pattern classification techniques do to collect records (Training set). Regional income is generally derived from local taxes and levies, local taxes are one source of funding for the region on the national average has not been able to make a large contribution to the formation of local revenue. By utilizing Regional Revenue data, it can produce forecasting and predictions of Regional Revenue income in the future to match the reality / reality so that the planned RAPBD can run smoothly. Simple Linear Regression or often abbreviated as SLR (Simple Linear Regression) is one of the statistical methods used in production to make predictions or predictions about the characteristics of quality and quantity to describe the processes associated with data processing for the acquisition of regional income. So that in the testing phase with visual basic net can help in processing valid Regional Revenue Amount data. Keywords: Data Mining, Local Revenue, Simple Linear Regression Algorithm, Visual Basic net 2008

Download Full-text

Influence of internal and external risks on sustainability of financial system of Аltai krai

Voprosy regionalnoj ekonomiki ◽

10.21499/2078-4023-2018--4-133-136 ◽

2018 ◽

Vol 35 (4) ◽

pp. 133-136

Author(s):

R. N. Ibragimov

Keyword(s):

Sustainable Development ◽

Financial Stability ◽

Management Strategy ◽

Financial System ◽

Risk Management Strategy ◽

The Sustainable Development ◽

The Stability ◽

The Impact

The article examines the impact of internal and external risks on the stability of the financial system of the Altai Territory. Classification of internal and external risks of decline, affecting the sustainable development of the financial system, is presented. A risk management strategy is proposed that will allow monitoring of risks, thereby these measures will help reduce the loss of financial stability and ensure the long-term development of the economy of the region.

Download Full-text

DEVELOPMENT OF THE BASIC CRITERIA FOR RUSSIAN COASTS TYPIFICATION

Proceedings of International Conference "Managinag risks to coastal regions and communities in a changinag world" (EMECS'11 - SeaCoasts XXVI) ◽

10.21610/conferencearticle_58b431526b37b ◽

2017 ◽

Author(s):

Рубен Косян ◽

Ruben Kosyan ◽

Viacheslav Krylenko ◽

Viacheslav Krylenko

Keyword(s):

Coastal Zone ◽

Coastal Area ◽

Economic Activity ◽

Basic Information ◽

Dynamic Features ◽

Static State ◽

Evolutionary Features ◽

Coastal Features ◽

The Stability

There are many types of coasts classifications that indicate main coastal features. As a rule, the "static" state of the coasts is considered regardless of their evolutionary features and ways to further transformation. Since the most part of the coastal zone studies aimed at ensuring of economic activity, it is clear that the classification of coast types should indicate total information required by the users. Accordingly, the coast classification should include the criterion, characterizing as dynamic features of the coast and the conditions and opportunities of economic activity. The coast classification, of course, should be based on geomorphological coast typification. Similar typification has been developed by leading scientists from Russia and can be used with minimal modifications. The authors propose to add to basic information (geomorphological type of coast) the evaluative part for each coast sector. It will include the estimation of the coast changes probability and the complexity of the coast stabilization for economic activity. This method will allow to assess the dynamics of specific coastal sections and the processes intensity and, as a result – the stability of the coastal area.

Download Full-text

Normal tissue content impact on the GBM molecular classification

Briefings in Bioinformatics ◽

10.1093/bib/bbaa129 ◽

2020 ◽

Author(s):

Rodrigo Madurga ◽

Noemí García-Romero ◽

Beatriz Jiménez ◽

Ana Collazo ◽

Francisco Pérez-Rodríguez ◽

...

Keyword(s):

Normal Tissue ◽

In Silico Analysis ◽

Molecular Classification ◽

Gene Signature ◽

Normal Brain ◽

Normal Cells ◽

Final Model ◽

Tumor Subtypes ◽

Tissue Content

Abstract Molecular classification of glioblastoma has enabled a deeper understanding of the disease. The four-subtype model (including Proneural, Classical, Mesenchymal and Neural) has been replaced by a model that discards the Neural subtype, found to be associated with samples with a high content of normal tissue. These samples can be misclassified preventing biological and clinical insights into the different tumor subtypes from coming to light. In this work, we present a model that tackles both the molecular classification of samples and discrimination of those with a high content of normal cells. We performed a transcriptomic in silico analysis on glioblastoma (GBM) samples (n = 810) and tested different criteria to optimize the number of genes needed for molecular classification. We used gene expression of normal brain samples (n = 555) to design an additional gene signature to detect samples with a high normal tissue content. Microdissection samples of different structures within GBM (n = 122) have been used to validate the final model. Finally, the model was tested in a cohort of 43 patients and confirmed by histology. Based on the expression of 20 genes, our model is able to discriminate samples with a high content of normal tissue and to classify the remaining ones. We have shown that taking into consideration normal cells can prevent errors in the classification and the subsequent misinterpretation of the results. Moreover, considering only samples with a low content of normal cells, we found an association between the complexity of the samples and survival for the three molecular subtypes.

Download Full-text