scholarly journals DFC

Author(s):  
Nida Meddouri ◽  
Hela Khoufi ◽  
Mondher Maddouri

Knowledge discovery data (KDD) is a research theme evolving to exploit a large data set collected every day from various fields of computing applications. The underlying idea is to extract hidden knowledge from a data set. It includes several tasks that form a process, such as data mining. Classification and clustering are data mining techniques. Several approaches were proposed in classification such as induction of decision trees, Bayes net, support vector machine, and formal concept analysis (FCA). The choice of FCA could be explained by its ability to extract hidden knowledge. Recently, researchers have been interested in the ensemble methods (sequential/parallel) to combine a set of classifiers. The combination of classifiers is made by a vote technique. There has been little focus on FCA in the context of ensemble learning. This paper presents a new approach to building a single part of the lattice with best possible concepts. This approach is based on parallel ensemble learning. It improves the state-of-the-art methods based on FCA since it handles more voluminous data.

2019 ◽  
Vol 21 (9) ◽  
pp. 662-669 ◽  
Author(s):  
Junnan Zhao ◽  
Lu Zhu ◽  
Weineng Zhou ◽  
Lingfeng Yin ◽  
Yuchen Wang ◽  
...  

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ruolan Zeng ◽  
Jiyong Deng ◽  
Limin Dang ◽  
Xinliang Yu

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Johannes Masino ◽  
Jakob Thumm ◽  
Guillaume Levasseur ◽  
Michael Frey ◽  
Frank Gauterin ◽  
...  

This work aims at classifying the road condition with data mining methods using simple acceleration sensors and gyroscopes installed in vehicles. Two classifiers are developed with a support vector machine (SVM) to distinguish between different types of road surfaces, such as asphalt and concrete, and obstacles, such as potholes or railway crossings. From the sensor signals, frequency-based features are extracted, evaluated automatically with MANOVA. The selected features and their meaning to predict the classes are discussed. The best features are used for designing the classifiers. Finally, the methods, which are developed and applied in this work, are implemented in a Matlab toolbox with a graphical user interface. The toolbox visualizes the classification results on maps, thus enabling manual verification of the results. The accuracy of the cross-validation of classifying obstacles yields 81.0% on average and of classifying road material 96.1% on average. The results are discussed on a comprehensive exemplary data set.


2020 ◽  
Vol 2 (2) ◽  
pp. 01-17
Author(s):  
Khamami Herusantoso ◽  
Ardyanto Dwi Saputra

In the dwell-time, the customs clearance is considered as the most complex phase, even though its portion is the shortest among other phases, such as pre-clearance and post clearance. In order to improve the efficiency and effectiveness on the services performed in the customs clearance process, the customs authorities must start considering the help of database analysis in identifying obstacles instead of depending on the personal analysis. Useful information is hidden among the importation data set and it is extractable through data mining techniques. This study explores the customs clearance process of import cargo whose document is declared through the red channel at Prime Customs Office Type A of Tanjung Priok (PCO Tanjung Priok), and applies a specific data mining classifier called the decision tree with J48 algorithm to evaluate the process. There are 11 classification models developed using unpruned, online pruning, and post-pruning features. One best model is chosen to extract the hidden knowledge that describes factors affecting the customs clearance process and allows the customs authorities to improve their services performed in the future.


2021 ◽  
pp. 42-51
Author(s):  
Muhammed J. A. Patwary ◽  
S. Akter ◽  
M. S. Bin Alam ◽  
A. N. M. Rezaul Karim

Bank deposit is one of the vital issues for any financial institution. It is very challenging to predict a customer if he/she can be a depositor by analyzing related information. Some recent reports demonstrate that economic depression and the continuous decline of the economy negatively impact business organizations and banking sectors. Due to such economic depression, banks cannot attract a customer's attention. Thus, marketing is preferred to be a handy tool for the banking sector to draw customers' attention for a term deposit. The purpose of this paper is to study the performance of ensemble learning algorithms which is a novel approach to predict whether a new customer will have a term deposit or not. A Portuguese retail bank data is used for our study, containing 45,211 phone contacts with 16 input attributes and one decision attribute. The data are preprocessed by using the Discretization technique. 40,690 samples are used for training the classifiers, and 4,521 samples are used for testing. In this work, the performance of the three mostly used classification algorithms named Support Vector Machine (SVM), Neural Network (NN), and Naive Bayes (NB) are analyzed. Then the ability of ensemble methods to improve the efficiency of basic classification algorithms is investigated and experimentally demonstrated. Experimental results exhibit that the performance metrics of Neural Network (Bagging) is higher than other ensemble methods. Its accuracy, sensitivity, and specificity are 96.62%, 97.14%, and 99.08%, respectively. Although all input attributes are considered in the classification method, in the end, a descriptive analysis has shown that some input attributes have more importance for this classification. Overall, it is shown that ensemble methods outperformed the traditional algorithms in this domain. We believe our contribution can be used as a depositor prediction system to provide additional support for bank deposit prediction.


Author(s):  
Nida Meddouri ◽  
Mondher Maddouri

Knowledge discovery in databases (KDD) aims to exploit the large amounts of data collected every day in various fields of computing application. The idea is to extract hidden knowledge from a set of data. It gathers several tasks that constitute a process, such as: data selection, pre-processing, transformation, data mining, visualization, etc. Data mining techniques include supervised classification and unsupervised classification. Classification consists of predicting the class of new instances with a classifier built on learning data of labeled instances. Several approaches were proposed such as: the induction of decision trees, Bayes, nearest neighbor search, neural networks, support vector machines, and formal concept analysis. Learning formal concepts always refers to the mathematical structure of concept lattice. This article presents a state of the art on formal concept analysis classifier. The authors present different ways to calculate the closure operators from nominal data and also present new approach to build only a part of the lattice including the best concepts. This approach is based on Dagging (ensemble method) that generates an ensemble of classifiers, each one represents a formal concept, and combines them by a voting rule. Experimental results are given to prove the efficiency of the proposed method.


2013 ◽  
Vol 5 (1) ◽  
pp. 66-83 ◽  
Author(s):  
Iman Rahimi ◽  
Reza Behmanesh ◽  
Rosnah Mohd. Yusuff

The objective of this article is an evaluation and assessment efficiency of the poultry meat farm as a case study with the new method. As it is clear poultry farm industry is one of the most important sub- sectors in comparison to other ones. The purpose of this study is the prediction and assessment efficiency of poultry farms as decision making units (DMUs). Although, several methods have been proposed for solving this problem, the authors strongly need a methodology to discriminate performance powerfully. Their methodology is comprised of data envelopment analysis and some data mining techniques same as artificial neural network (ANN), decision tree (DT), and cluster analysis (CA). As a case study, data for the analysis were collected from 22 poultry companies in Iran. Moreover, due to a small data set and because of the fact that the authors must use large data set for applying data mining techniques, they employed k-fold cross validation method to validate the authors’ model. After assessing efficiency for each DMU and clustering them, followed by applied model and after presenting decision rules, results in precise and accurate optimizing technique.


2008 ◽  
Vol 07 (04) ◽  
pp. 721-736 ◽  
Author(s):  
HSIAO-FAN WANG ◽  
ZU-WEN CHAN

In this study, we proposed a general pruning procedure to reduce the dimension of a large database so that the properties of the extracted subset can be well defined. Since learning functions have been widely applied, we take this group of functions as an example to demonstrate the proposed procedure. Based on the concept of Support Vector Machine (SVM), three major stages of preliminary pruning, fitting function, and refining are proposed to discover a subset that possess the characteristics of some learning function from the given large data set. Three models were used to illustrate and evaluate the proposed pruning procedure and the results have shown to be promising in application.


2007 ◽  
Vol 19 (3) ◽  
pp. 816-855 ◽  
Author(s):  
Hyunjung Shin ◽  
Sungzoon Cho

The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training.


Sign in / Sign up

Export Citation Format

Share Document