scholarly journals INTEGRASI NAIVE BAYES DENGAN TEKNIK SAMPLING SMOTE UNTUK MENANGANI DATA TIDAK SEIMBANG

2020 ◽  
Vol 14 (1) ◽  
pp. 34
Author(s):  
Nina Sulistiyowati ◽  
Mohamad Jajuli

Classification of data with unbalanced classes is a major problem in the field of machine learning and data mining. If working on unbalanced data, almost all classification algorithms will produce much higher accuracy for majority classes than minority classes. This research will implement the Synthetic Minority Over-sampling Technique (SMOTE) method to overcome unbalanced data on credit customer data in Rawamerta teacher cooperatives. The research methodology uses SEMMA with the stages of research Sample, Explore, Modify, Model, and Asses. The Sample Phase was conducted to choose the data of the Rawamerta Teachers Cooperative credit customers for 2015-2017 with a total of 878 data with the attributes used namely income, total deposits, loan amount, duration of installments, services, installments, and credit status. The Explore phase analyzes current classes which are categorized as majority classes because there are 813 data, while traffic classes can be categorized as minority classes because there are 65 data. The data shows an imbalance of data between the two classes. The Modify stages perform the 500% SMOTE process. The Model Stage classifies using Na�ve Bayes. Na�ve Bayes modeling with SMOTE produced 1131 successfully classified data correctly and 72 data were not classified correctly while without SMOTE resulted in 818 data was classified correctly and 60 data were not classified correctly.Keywords: Na�ve Bayes, SMOTE, unbalanced data

2019 ◽  
Vol 8 (4) ◽  
pp. 4039-4042

Recently, the learning from unbalanced data has emerged to be a pre-dominant problem in several applications and in that multi label classification is an evolving data mining task, learning from unbalanced multilabel data is being examined. However, the available algorithms-based SMOTE makes use of the same sampling rate for every instance of the minority class. This leads to sub-optimal performance. To deal with this problem, a new Particle Swarm Optimization based SMOTE (PSOSMOTE) algorithm is proposed. The PSOSMOTE algorithm employs diverse sampling rates for multiple minority class instances and gets the fusion of optimal sampling rates and to deal with classification of unbalanced datasets. Then, Bayesian technique is combined with Random forest for multilabel classification (BARF-MLC) is to address the inherent label dependencies among samples such as ML-FOREST classifier, Predictive Clustering Trees (PCT), Hierarchy of Multi Label Classifier (HOMER) by taking the different metrics including precision, recall, F-measure, Accuracy and Error Rate.


Information ◽  
2020 ◽  
Vol 11 (12) ◽  
pp. 557
Author(s):  
Alexandre M. de Carvalho ◽  
Ronaldo C. Prati

One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).


Author(s):  
Dilip Kumar Sharma ◽  
Sarika Lohana ◽  
Saurabh Arora ◽  
Ashutosh Dixit ◽  
Mohit Tiwari ◽  
...  

2020 ◽  
Author(s):  
Valerio Carruba

<p>Asteroid families are groups of asteroids that are the product of collisions or of the rotational fission of a parent object.  These groups are mainly identified in proper elements or frequencies domains.   Because of robotic telescope surveys, the number of known asteroids has increased from about 10,000 in the early 90's to more than 750,000 nowadays. Traditional approaches for identifying new members of asteroid families, like the hierarchical clustering method (HCM), may   struggle to keep up with the growing rate of new discoveries. Here we used machine learning classification algorithms to identify new family members based on the orbital distribution in proper (a,e,sin(i)) of previously known family constituents. We compared the outcome of nine classification algorithms from stand alone and ensemble approaches.  The Extremely Randomized Trees (ExtraTree) method had the highest precision, enabling to  retrieve up to 97% of family members identified with standard HCM.</p>


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Dennie te Molder ◽  
Wasin Poncheewin ◽  
Peter J. Schaap ◽  
Jasper J. Koehorst

Abstract Background The genus Xanthomonas has long been considered to consist predominantly of plant pathogens, but over the last decade there has been an increasing number of reports on non-pathogenic and endophytic members. As Xanthomonas species are prevalent pathogens on a wide variety of important crops around the world, there is a need to distinguish between these plant-associated phenotypes. To date a large number of Xanthomonas genomes have been sequenced, which enables the application of machine learning (ML) approaches on the genome content to predict this phenotype. Until now such approaches to the pathogenomics of Xanthomonas strains have been hampered by the fragmentation of information regarding pathogenicity of individual strains over many studies. Unification of this information into a single resource was therefore considered to be an essential step. Results Mining of 39 papers considering both plant-associated phenotypes, allowed for a phenotypic classification of 578 Xanthomonas strains. For 65 plant-pathogenic and 53 non-pathogenic strains the corresponding genomes were available and de novo annotated for the presence of Pfam protein domains used as features to train and compare three ML classification algorithms; CART, Lasso and Random Forest. Conclusion The literature resource in combination with recursive feature extraction used in the ML classification algorithms provided further insights into the virulence enabling factors, but also highlighted domains linked to traits not present in pathogenic strains.


Software engineering is an important area that deals with development and maintenance of software. After developing a software, it is always important to track its performance. One has to always see whether the software functions according to customer requirements. To ensure this, faulty and non- faulty modules must be identified. For this purpose, one can make use of a model for binary class classification of faults. Different technique's outputs differ in one or the other way with respect to the following: fault dataset used, complexity, classification algorithm implemented, etc. Various machine learning techniques can be used for this purpose. But this paper deals with the best classification algorithms available till date and they are decision tree, random forest, naive bayes and logistic regression (tree-based techniques and bayesian based techniques). The motive behind developing such a project is to identify the faulty modules within a software before the actual software testing takes place. As a result, the time consumed by testers or the workload of the testers can be reduced to an extent. This work is very well useful to those working in software industry and also to those people carrying out research in software engineering where the lifecycle of development of a software is discussed.


2021 ◽  
Author(s):  
Bongs Lainjo

Abstract Background: Information technology has continued to shape contemporary thematic trends. Advances in communication have impacted almost all themes ranging from education, engineering, healthcare, and many other aspects of our daily lives. Method: This paper attempts to review the different dynamics of the thematic IoT platforms. A select number of themes are extensively analyzed with emphasis on data mining (DM), personalized healthcare (PHC), and thematic trends of a select number of subjectively identified IoT-related publications over three years. In this paper, the number of IoT-related-publications is used as a proxy representing the number of apps. DM remains the trailblazer, serving as a theme with crosscutting qualities that drive artificial intelligence (AI), machine learning (ML), and data transformation. A case study in PHC illustrates the importance, complexity, productivity optimization, and nuances contributing to a successful IoT platform. Among the initial 99 IoT themes, 18 are extensively analyzed using the number of IoT publications to demonstrate a combination of different thematic dynamics, including subtleties that influence escalating IoT publication themes. Results: Based on findings amongst the 99 themes, the annual median IoT-related publications for all the themes over the four years were increasingly 5510, 8930, 11700, and 14800 for 2016, 2017, 2018, and 2019 respectively; indicating an upbeat prognosis of IoT dynamics. Conclusion: The vulnerabilities that come with the successful implementation of IoT systems are highlighted including the successes currently achieved by institutions promoting the benefits of IoT-related systems like the case study. Security continues to be an issue of significant importance.


2021 ◽  
Author(s):  
Raz Mohammad Sahar ◽  
T. Srivinasa Rao ◽  
S. Anuradha ◽  
B. Srinivasa Rao

Gender classification is amongst the significant problems in the area of signal processing; previously, the problem was handled using different image classification methods, which mainly involve data extraction from a collection of images. Nevertheless, researchers over the globe have recently shown interest in gender classification using voiced features. The classification of gender goes beyond just the frequency and pitch of a human voice, according to a critical study of some of the human vocal attributes. Feature selection, which is from a technical point of view termed dimensionality reduction, is amongst the difficult problems encountered in machine learning. A similar obstacle is encountered when choosing gender particular features—which presents an analytical purpose in analyzing a human’s gender. This work will examine the effectiveness and importance of classification algorithms to the classification of gender via voice problems. Audial data, for example, pitch, frequency, etc., help in determining gender. Machine learning offers encouraging outcomes for classification problems in all domains. An area’s algorithms can be evaluated using performance metrics. This paper evaluates five different classification Algorithms of machine learning based on the classification of gender from audial data. The plan is to recognize gender using five different algorithms: Gradient Boosting, Decision Trees, Random Forest, Neural network, and Support Vector Machine. The major parameter in assessing any algorithm must be performance. Misclassifying rate ratio should not be more in classifying problems. In business markets, the location and gender of people are essentially related to AdSense. This research aims at comparing various machine learning algorithms in order to find the most suitable fitting for gender identification in audial data.


2020 ◽  
Vol 5 (3) ◽  
pp. 138-146 ◽  

:Most online customers use cards to pay for their purchases. As charge cards become the most mainstream strategy for installment, instances of misrepresentation relationship with it too increases. The primary goal of this venture is to be ready to perceive false exchanges from non-fake exchanges. In request to do so,primarily,data mining methods are utilized to examine the examples and attributes of deceitful and non-fake transactions.Then,machine learning systems are utilized to foresee the fake and non-fake exchanges automatically. Algorithms LR (Logistic Regression) is used. Therefore, the blend of AI and information mining procedures are utilized to distinguish the fake and non-fake exchanges by learning the examples of the information. Models are made utilizing these calculations and afterward precision,accuracy,recall are determined and an examination is made.


Sign in / Sign up

Export Citation Format

Share Document