Classifying clinically actionable genetic mutations using KNN and SVM

Cancer is one of the major causes of death in humans. Early diagnosis of genetic mutations that cause cancer tumor growth leads to personalized medicine to the decease and can save the life of majority of patients. With this aim, Kaggle has conducted a competition to classify clinically actionable gene mutations based on clinical evidence and some other features related to gene mutations. The dataset contains 3321 training data points that can be classified into 9 classes. In this work, an attempt is made to classify these data points using K-nearest neighbors (KNN) and linear support vector machines (SVM) in a multi class environment. As the features are categorical, one hot encoding as well as response coding are applied to make them suitable to the classifiers. The prediction performance is evaluated using log loss and KNN has performed better with a log loss value of 1.10 compared to that of SVM 1.24.

Download Full-text

A Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Ligand-Target Predictions

10.26434/chemrxiv.11526132 ◽

2020 ◽

Author(s):

Lewis Mervin ◽

Avid M. Afzal ◽

Ola Engkvist ◽

Andreas Bender

Keyword(s):

Target Prediction ◽

Support Vector ◽

Machine Learning Method ◽

Learning Method ◽

Protein Target ◽

Bioactivity Prediction ◽

Vector Machines ◽

Scaling Methods ◽

Data Points ◽

Compound Target

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into reliable probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling, Isotonic Regression and Venn-ABERS in calibrating prediction scores for ligand-target prediction comprising the Naïve Bayes, Support Vector Machines and Random Forest algorithms with bioactivity data available at AstraZeneca (40 million data points (compound-target pairs) across 2112 targets). Performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation.

Download Full-text

Persian Handwritten Number Recognition Using Adapted Framing Feature and Support Vector Machines

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026816500048 ◽

2016 ◽

Vol 15 (01) ◽

pp. 1650004 ◽

Cited By ~ 3

Author(s):

Hedieh Sajedi ◽

Mehran Bahador

Keyword(s):

Support Vector Machines ◽

Recognition Rate ◽

Nearest Neighbors ◽

Polynomial Kernel ◽

Support Vector ◽

K Nearest Neighbors ◽

New Approach ◽

Number Recognition ◽

Vector Machines

In this paper, a new approach for segmentation and recognition of Persian handwritten numbers is presented. This method utilizes the framing feature technique in combination with outer profile feature that we named this the adapted framing feature. In our proposed approach, segmentation of the numbers into digits has been carried out automatically. In the classification stage of the proposed method, Support Vector Machines (SVM) and k-Nearest Neighbors (k-NN) are used. Experimentations are conducted on the IFHCDB database consisting 17,740 numeral images and HODA database consisting 102,352 numeral images. In isolated digit level on IFHCDB, the recognition rate of 99.27%, is achieved by using SVM with polynomial kernel. Furthermore, in isolated digit level on HODA, the recognition rate of 99.07% is achieved by using SVM with polynomial kernel. The experiments illustrate that applying our proposed method resulted higher accuracy compared to previous researches.

Download Full-text

Evaluating Grayware Characteristics and Risks

Journal of Computer Networks and Communications ◽

10.1155/2011/569829 ◽

2011 ◽

Vol 2011 ◽

pp. 1-28 ◽

Cited By ~ 1

Author(s):

Zhongqiang Chen ◽

Zhanyan Liang ◽

Yuan Zhang ◽

Zhongrong Chen

Keyword(s):

Information Gain ◽

Feature Space ◽

Training Data ◽

Support Vector ◽

Learning Models ◽

Generalization Capability ◽

Self Organizing Maps ◽

Defense Strategies ◽

Security Applications ◽

Vector Machines

Grayware encyclopedias collect known species to provide information for incident analysis, however, the lack of categorization and generalization capability renders them ineffective in the development of defense strategies against clustered strains. A grayware categorization framework is therefore proposed here to not only classify grayware according to diverse taxonomic features but also facilitate evaluations on grayware risk to cyberspace. Armed with Support Vector Machines, the framework builds learning models based on training data extracted automatically from grayware encyclopedias and visualizes categorization results with Self-Organizing Maps. The features used in learning models are selected with information gain and the high dimensionality of feature space is reduced by word stemming and stopword removal process. The grayware categorizations on diversified features reveal that grayware typically attempts to improve its penetration rate by resorting to multiple installation mechanisms and reduced code footprints. The framework also shows that grayware evades detection by attacking victims' security applications and resists being removed by enhancing its clotting capability with infected hosts. Our analysis further points out that species in categoriesSpywareandAdwarecontinue to dominate the grayware landscape and impose extremely critical threats to the Internet ecosystem.

Download Full-text

Optimizing Training Data and Hyperparameters of Support Vector Machines Using a Memetic Algorithm

Advances in Intelligent Systems and Computing - Man-Machine Interactions 6 ◽

10.1007/978-3-030-31964-9_22 ◽

2019 ◽

pp. 229-238

Author(s):

Wojciech Dudzik ◽

Michal Kawulok ◽

Jakub Nalepa

Keyword(s):

Support Vector Machines ◽

Memetic Algorithm ◽

Training Data ◽

Support Vector ◽

Vector Machines

Download Full-text

DETECTION OF DISEASE SYMPTOMS ON HYPERSPECTRAL 3D PLANT MODELS

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-iii-7-89-2016 ◽

2016 ◽

Vol III-7 ◽

pp. 89-96 ◽

Cited By ~ 2

Author(s):

Ribana Roscher ◽

Jan Behmann ◽

Anne-Katrin Mahlein ◽

Jan Dupuis ◽

Heiner Kuhlmann ◽

...

Keyword(s):

Support Vector Machines ◽

Sparse Representation ◽

Test Data ◽

Training Data ◽

Hyperspectral Images ◽

Spectral Information ◽

Support Vector ◽

Leaf Spot Disease ◽

Disease Symptoms ◽

Vector Machines

We analyze the benefit of combining hyperspectral images information with 3D geometry information for the detection of <i>Cercospora</i> leaf spot disease symptoms on sugar beet plants. Besides commonly used one-class Support Vector Machines, we utilize an unsupervised sparse representation-based approach with group sparsity prior. Geometry information is incorporated by representing each sample of interest with an inclination-sorted dictionary, which can be seen as an 1D topographic dictionary. We compare this approach with a sparse representation based approach without geometry information and One-Class Support Vector Machines. One-Class Support Vector Machines are applied to hyperspectral data without geometry information as well as to hyperspectral images with additional pixelwise inclination information. Our results show a gain in accuracy when using geometry information beside spectral information regardless of the used approach. However, both methods have different demands on the data when applied to new test data sets. One-Class Support Vector Machines require full inclination information on test and training data whereas the topographic dictionary approach only need spectral information for reconstruction of test data once the dictionary is build by spectra with inclination.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Recognition of Gait Activities Using Acceleration Data from A Smartphone and A Wearable Device

Proceedings ◽

10.3390/proceedings2019031060 ◽

2019 ◽

Vol 31 (1) ◽

pp. 60 ◽

Cited By ~ 1

Author(s):

Irvin Hussein Lopez-Nava ◽

Matias Garcia-Constantino ◽

Jesus Favela

Keyword(s):

Assisted Living ◽

Inertial Sensor ◽

Ambient Assisted Living ◽

Human Gait ◽

Support Vector ◽

K Nearest Neighbors ◽

Acceleration Data ◽

Vector Machines ◽

Young Subjects ◽

Physical Spaces

Activity recognition is an important task in many fields, such as ambient intelligence, pervasive healthcare, and surveillance. In particular, the recognition of human gait can be useful to identify the characteristics of the places or physical spaces, such as whether the person is walking on level ground or walking down stairs in which people move. For example, ascending or descending stairs can be a risky activity for older adults because of a possible fall, which can have more severe consequences than if it occurred on a flat surface. While portable and wearable devices have been widely used to detect Activities of Daily Living (ADLs), few research works in the literature have focused on characterizing only actions of human gait. In the present study, a method for recognizing gait activities using acceleration data obtained from a smartphone and a wearable inertial sensor placed on the ankle of people is introduced. The acceleration signals were segmented based on the automatic detection of strides, also called gait cycles. Subsequently, a feature vector of the segmented signals was extracted, which was used to train four classifiers using the Naive Bayes, C4.5, Support Vector Machines, and K-Nearest Neighbors algorithms. Data was collected from seven young subjects who performed five gait activities: (i) going down an incline, (ii) going up an incline, (iii) walking on level ground, (iv) going down stairs, and (v) going up stairs. The results demonstrate the viability of using the proposed method and technologies in ambient assisted living contexts.

Download Full-text

Assessment of Interventions in Fuel Management Zones Using Remote Sensing

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090533 ◽

2020 ◽

Vol 9 (9) ◽

pp. 533 ◽

Cited By ~ 2

Author(s):

Ricardo Afonso ◽

André Neves ◽

Carlos Viegas Damásio ◽

João Moura Pires ◽

Fernando Birra ◽

...

Keyword(s):

Satellite Images ◽

Vegetation Indices ◽

Nearest Neighbors ◽

Machine Learning Algorithms ◽

Support Vector ◽

Fuel Management ◽

K Nearest Neighbors ◽

Management Zones ◽

Vector Machines ◽

Sentinel 2

Every year, wildfires strike the Portuguese territory and are a concern for public entities and the population. To prevent a wildfire progression and minimize its impact, Fuel Management Zones (FMZs) have been stipulated, by law, around buildings, settlements, along national roads, and other infrastructures. FMZs require monitoring of the vegetation condition to promptly proceed with the maintenance and cleaning of these zones. To improve FMZ monitoring, this paper proposes the use of satellite images, such as the Sentinel-1 and Sentinel-2, along with vegetation indices and extracted temporal characteristics (max, min, mean and standard deviation) associated with the vegetation within and outside the FMZs and to determine if they were treated. These characteristics feed machine-learning algorithms, such as XGBoost, Support Vector Machines, K-nearest neighbors and Random Forest. The results show that it is possible to detect an intervention in an FMZ with high accuracy, namely with an F1-score ranging from 90% up to 94% and a Kappa ranging from 0.80 up to 0.89.

Download Full-text

Optimizing the prediction accuracy of load-settlement behavior of single pile using a self-learning data mining approach

MATEC Web of Conferences ◽

10.1051/matecconf/201925802010 ◽

2019 ◽

Vol 258 ◽

pp. 02010

Author(s):

Doddy Prayogo ◽

Yudas Tadeus Teddy Susanto

Keyword(s):

Data Mining ◽

Prediction Accuracy ◽

Soil Layer ◽

Training Data ◽

Support Vector ◽

Data Mining Approach ◽

Settlement Behavior ◽

Data Points ◽

Single Piles ◽

Self Learning

Pile foundations usually are used when the upper soil layers are soft clay and, hence, unable to support the structures’ loads. Piles are needed to carry these loads deep into the hard soil layer. Therefore, the safety and stability of pile-supported structures depends on the behavior of the piles. Additionally, an accurate prediction of the piles’ behavior is very important to ensure satisfactory performance of the structures. Although many methods in the literature estimate the settlement of the piles both theoretically and experimentally, methods for comprehensively predicting the load-settlement of piles are very limited. This study develops a new data mining approach called self-learning support vector machine (SL-SVM) to predict the load-settlement behavior of single piles. SL-SVM performance is investigated using 446 training data points and 53 test data points of cone penetration test (CPT) data obtained from the previous literature. The actual prediction accuracy is then compared to other prediction methods using three statistical measurements, including mean absolute error (MAE), coefficient of correlation (R), and root mean square error (RMSE). The obtained results show that SL-SVM achieves better accuracy than does LS-SVM and BPNN. This confirms the capability of the proposed data mining method to model the accurate load-settlement behavior of single piles through CPT data. The paper proposes beneficial insights for geotechnical engineers involved in estimating pile behavior.

Download Full-text