Research on the Confidence Regression Based on KNN Algorithm

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

EKMPRFG: Ensemble of KNN, Multilayer Perceptron and Random Forest using Grading for Android Malware Classification

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e5866.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 3353-3360

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Standard Deviation ◽

Principal Component ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Data Sets ◽

Android Malware ◽

Android Malware Detection ◽

Significant Research

Android is the most popular Operating Systems with over 2.5 billion devices across the globe. The popularity of this OS has unfortunately made the devices and the services they enable, vulnerable to numerous security threats. As a result of this, a significant research is being done in the field of Android Malware Detection employing Machine Learning Algorithms. Our current work emphasizes on the possible use of Machine Learning techniques for the detection of malware on such android devices. The proposed EKMPRFG is applied for the classification of Android Malware after a preprocessing phase involving a hybrid Feature Selection model using proposed Standard Deviation of Standard Deviation of Ranks (SDSDR) and several other builtin Feature Selection algorithms such as Correlation based Feature Selection (CFS), Classifier SubsetEval, Consistency SubsetEval, and Filtered SubsetEval followed by Principal Component Analysis(PCA) for dimensionality reduction. The experimental results obtained on two data sets indicate that EKMPRFG outperforms the existing works in terms of Prediction Accuracy and Weighted F- Measure values.

Download Full-text

Scientific report: Training workshop interdisciplinary life sciences

10.7287/peerj.preprints.654 ◽

2014 ◽

Author(s):

Gordon Akudibillah ◽

Sonja E.M. Boas ◽

Benoit M. Carreres ◽

Marchien Dallinga ◽

Aalt-Jan van Dijk ◽

...

Keyword(s):

Experimental Data ◽

Computational Models ◽

Life Sciences ◽

Cellular Level ◽

Research Field ◽

Complex Nature ◽

Data Sets ◽

Training Workshop ◽

Open Problems ◽

Diverse Range

This preprint is the outcome of the “Training Workshop Interdisciplinary Life Sciences”, held in October 2013 in the Lorentz Center, Leiden, The Netherlands. The motivation to organize this event stems from the following considerations: The enormous progress in laboratory techniques and facilities leads to the availability of huge amounts of data at all levels of complexity (molecules, cells, tissues, organs, organisms, populations, ecosystems). Especially data at the cellular level reveal details of life processes we were unconscious of until recently. However, it becomes clear that huge amounts of data alone do not automatically lead to understanding. The data explosion in Life Sciences teaches one lesson: life processes are of a highly intricate and integrative nature. To really understand the dynamic processes in living organisms one must integrate experimental data sets in quantitative and predictive models. Only then one may hope to grasp the functioning of these complex systems and be able to convert information in understanding. In the field of physics, for instance, this strong interaction between experiment and theory is already common practice since centuries, culminating in the 20th century being called the ’Century of Physics’. In contrast to physics, the complex nature of the Life Sciences forces us to work in an interdisciplinary fashion. The necessary expertise is available, but scattered over many scientific disciplines. Only the combined efforts of biologists, chemists, mathematicians, physicists, engineers, and informaticians will lead to progress in tackling the huge challenge of understanding the complexity of life. Researchers in the Life Sciences often focus their research on a rather narrow research field. However, the majority of the upcoming generation of researchers in the Life Sciences should be trained to expand their skills, becoming able to tackle complex, multi-dimensional systems. The knowledge they have to incorporate in their research will stem from a diverse range of disciplines, So, they should be trained to integrate a broad range of modelling approaches in order to deduce quantitative, predictive and often multi-scale models from highly diverse data sets. Present curricula in the Life Sciences hardly offer this kind of training yet. This workshop intends to start filling this gap. Three teams worked on the following open problems: 1) Modeling the influence of temperature on the Regulation of flowering time in Arabidopsis thaliana; 2) Validation of computational models of angiogenesis to experimental data; 3) Reconstructing the gene network that regulates branching in Tomato. This preprint bundles the reports of the three teams.

Download Full-text

Clustering helps to improve price prediction in online booking systems

International Journal of Web Information Systems ◽

10.1108/ijwis-11-2020-0065 ◽

2021 ◽

Vol 17 (1) ◽

pp. 45-53

Author(s):

Le Hong Trang ◽

Tran Duong Huy ◽

Anh Ngoc Le

Keyword(s):

Machine Learning ◽

Empirical Study ◽

Sentiment Analysis ◽

Design Methodology ◽

Prediction Performance ◽

Experimental Results ◽

Data Sets ◽

Classification Models ◽

Content Type ◽

Price Prediction

Purpose Pricing on the online booking systems is a difficult task for the host, the systems usually set the prices that are lower than the general premises and quality, and that only gives benefits to the system by easily attracting the customer to use the service. The setting price of the new accommodation is often based on location, the number of beds, type of house and so on. The main problem is to predict the most reasonable price for the host. This paper aims to study the use of machine learning and sentiment analysis for predicting the price of online booking systems. Design/methodology/approach In particular, an empirical study is performed first for some well-known classification models for the problems. The authors then propose to apply k-means, a clustering technique, together with Gradient Boost and XGBoost models to improve the prediction performance. Experiments are conducted and tested for real Airbnb data sets collected in London City. Findings Experimental results are given and compared to show that the authors’ method outperforms to an updated method. Originality/value The authors use k-means and sampling together with Gradient Boost and XGBoost models to improve the prediction performance.

Download Full-text

Quality of Refractory Materials in the Technological Process

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.524-527.2026 ◽

2012 ◽

Vol 524-527 ◽

pp. 2026-2030

Author(s):

Marek Šolc ◽

Štefan Markulik ◽

Eva Grambalová

Keyword(s):

Experimental Data ◽

Technological Process ◽

Maximum Concentration ◽

Statistical Processing ◽

Experimental Results ◽

Data Sets ◽

Refractory Materials ◽

Processing Data

In addressing issues related to technology or quality refractory products are among the supporting documents experimental results of the tests. These more or less extensive data sets characterize with some precision observed phenomenon, e.g. some physical or chemical quantity. The role of statistical processing of data from this perspective, the maximum concentration sometimes extremely abundant, but few clear set of experimental data and determine the "seriousness" of this file. When processing data it is to be noted that these characteristics are not fully observed variable, but only a selected part.

Download Full-text

A study of Turkish emotion classification with pretrained language models

Journal of Information Science ◽

10.1177/0165551520985507 ◽

2021 ◽

pp. 016555152098550

Author(s):

Alaettin Uçan ◽

Murat Dörterler ◽

Ebru Akçapınar Sezer

Keyword(s):

Machine Learning ◽

Language Model ◽

Experimental Studies ◽

Classification Performance ◽

Research Field ◽

Language Models ◽

Classification Model ◽

Data Sets ◽

Emotion Classification ◽

Model Approach

Emotion classification is a research field that aims to detect the emotions in a text using machine learning methods. In traditional machine learning (TML) methods, feature engineering processes cause the loss of some meaningful information, and classification performance is negatively affected. In addition, the success of modelling using deep learning (DL) approaches depends on the sample size. More samples are needed for Turkish due to the unique characteristics of the language. However, emotion classification data sets in Turkish are quite limited. In this study, the pretrained language model approach was used to create a stronger emotion classification model for Turkish. Well-known pretrained language models were fine-tuned for this purpose. The performances of these fine-tuned models for Turkish emotion classification were comprehensively compared with the performances of TML and DL methods in experimental studies. The proposed approach provides state-of-the-art performance for Turkish emotion classification.

Download Full-text

Scientific report: Training workshop interdisciplinary life sciences

10.7287/peerj.preprints.654v1 ◽

2014 ◽

Author(s):

Gordon Akudibillah ◽

Sonja E.M. Boas ◽

Benoit M. Carreres ◽

Marchien Dallinga ◽

Aalt-Jan van Dijk ◽

...

Keyword(s):

Experimental Data ◽

Computational Models ◽

Life Sciences ◽

Cellular Level ◽

Research Field ◽

Complex Nature ◽

Data Sets ◽

Training Workshop ◽

Open Problems ◽

Diverse Range

This preprint is the outcome of the “Training Workshop Interdisciplinary Life Sciences”, held in October 2013 in the Lorentz Center, Leiden, The Netherlands. The motivation to organize this event stems from the following considerations: The enormous progress in laboratory techniques and facilities leads to the availability of huge amounts of data at all levels of complexity (molecules, cells, tissues, organs, organisms, populations, ecosystems). Especially data at the cellular level reveal details of life processes we were unconscious of until recently. However, it becomes clear that huge amounts of data alone do not automatically lead to understanding. The data explosion in Life Sciences teaches one lesson: life processes are of a highly intricate and integrative nature. To really understand the dynamic processes in living organisms one must integrate experimental data sets in quantitative and predictive models. Only then one may hope to grasp the functioning of these complex systems and be able to convert information in understanding. In the field of physics, for instance, this strong interaction between experiment and theory is already common practice since centuries, culminating in the 20th century being called the ’Century of Physics’. In contrast to physics, the complex nature of the Life Sciences forces us to work in an interdisciplinary fashion. The necessary expertise is available, but scattered over many scientific disciplines. Only the combined efforts of biologists, chemists, mathematicians, physicists, engineers, and informaticians will lead to progress in tackling the huge challenge of understanding the complexity of life. Researchers in the Life Sciences often focus their research on a rather narrow research field. However, the majority of the upcoming generation of researchers in the Life Sciences should be trained to expand their skills, becoming able to tackle complex, multi-dimensional systems. The knowledge they have to incorporate in their research will stem from a diverse range of disciplines, So, they should be trained to integrate a broad range of modelling approaches in order to deduce quantitative, predictive and often multi-scale models from highly diverse data sets. Present curricula in the Life Sciences hardly offer this kind of training yet. This workshop intends to start filling this gap. Three teams worked on the following open problems: 1) Modeling the influence of temperature on the Regulation of flowering time in Arabidopsis thaliana; 2) Validation of computational models of angiogenesis to experimental data; 3) Reconstructing the gene network that regulates branching in Tomato. This preprint bundles the reports of the three teams.

Download Full-text

Research on the Controllable Confidence Machine

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.1079-1080.851 ◽

2014 ◽

Vol 1079-1080 ◽

pp. 851-855

Author(s):

Fang Chun Jiang ◽

Sheng Feng Tian

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Classification Accuracy ◽

Research Result ◽

Data Sets ◽

Threshold Values

Manageable confidence machine learning is one of the important approaches to implement confidence machine application. This paper is based on two class confidence classifier, adopting two class classifier as tool to convert learning results of classifiers and achieve confidence management through setting threshold values. The research accomplished manageable general accuracy of the classification and manageable positive/negative classification accuracy. Such method is tested in 5 experimental data sets of cardiopathy and diabetes, achieved preferable research result.

Download Full-text

From Data to Assessment Models, Demonstrated through a Digital Twin of Marine Risers

10.4043/30985-ms ◽

2021 ◽

Author(s):

Ehsan Kharazmi ◽

Zhicheng Wang ◽

Dixia Fan ◽

Samuel Rudy ◽

Themis Sapsis ◽

...

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Complex Systems ◽

Fatigue Damage ◽

Complete Characterization ◽

Sensor Data ◽

Data Sets ◽

Multiple Sources ◽

Marine Risers ◽

Vortex Induced Vibrations

Abstract Assessing the fatigue damage in marine risers due to vortex-induced vibrations (VIV) serves as a comprehensive example of using machine learning methods to derive assessment models of complex systems. A complete characterization of response of such complex systems is usually unavailable despite massive experimental data and computation results. These algorithms can use multi-fidelity data sets from multiple sources, including real-time sensor data from the field, systematic experimental data, and simulation data. Here we develop a three-pronged approach to demonstrate how tools in machine learning are employed to develop data-driven models that can be used for accurate and efficient fatigue damage predictions for marine risers subject to VIV.

Download Full-text

Research on the Confidence Regression Based on KNN Algorithm

Improved minimum-minimum roughness algorithm for clustering categorical data

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

EKMPRFG: Ensemble of KNN, Multilayer Perceptron and Random Forest using Grading for Android Malware Classification

Scientific report: Training workshop interdisciplinary life sciences

Clustering helps to improve price prediction in online booking systems

Quality of Refractory Materials in the Technological Process

A study of Turkish emotion classification with pretrained language models

Scientific report: Training workshop interdisciplinary life sciences

Research on the Controllable Confidence Machine

From Data to Assessment Models, Demonstrated through a Digital Twin of Marine Risers

Export Citation Format