An Imbalance SVM for MicroRNA Target Genes Prediction

2014 ◽  
Vol 577 ◽  
pp. 1245-1251
Author(s):  
Zhi Ru Chen ◽  
Wen Xue Hong ◽  
Pei Pei Zhao

Imbalance miRNA target sample data bring about the lower prediction accuracy of SVM(Support Vector Machine). This paper proposes an SVM algorithm to predict the target genes based on biased discriminant idea. This paper selects an optimal feature sets as input data, and constructs a kernel optimization objective function based on the biased discriminant analysis criteria in the empirical feature space. The conformal transformation of a kernel is utilized to gradually optimize the kernel matrix. Through the comparative analysis of the experimental results of human, mouse and rat, the imbalance SVM with biased discriminant has higher specificity, sensitivity and prediction accuracy, which proves that it has stronger generalization ability and better robustness.

2014 ◽  
Vol 602-605 ◽  
pp. 1634-1637
Author(s):  
Fang Nian Wang ◽  
Shen Shen Wang ◽  
Wan Fang Che ◽  
Yun Bai

An intrusion detection method based on RS-LSSVM is studied in this paper. Firstly, attribute reduction algorithm based on the generalized decision table is proposed to remove the interference features and reduce the dimension of input feature space. Then the classification method based on least square support vector machine (LSSVM) is analyzed. The sample data after dimension reduction is used for LSSVM training, and the LSSVM classification model is obtained, which forms the ability of detecting unknown intrusion. Simulation results show that the proposed method can effectively remove the unnecessary features and improve the performance of network intrusion detection.


Author(s):  
Beaulah Jeyavathana Rajendran ◽  
Kanimozhi K. V.

Tuberculosis is one of the hazardous infectious diseases that can be categorized by the evolution of tubercles in the tissues. This disease mainly affects the lungs and also the other parts of the body. The disease can be easily diagnosed by the radiologists. The main objective of this chapter is to get best solution selected by means of modified particle swarm optimization is regarded as optimal feature descriptor. Five stages are being used to detect tuberculosis disease. They are pre-processing an image, segmenting the lungs and extracting the feature, feature selection and classification. These stages that are used in medical image processing to identify the tuberculosis. In the feature extraction, the GLCM approach is used to extract the features and from the extracted feature sets the optimal features are selected by random forest. Finally, support vector machine classifier method is used for image classification. The experimentation is done, and intermediate results are obtained. The proposed system accuracy results are better than the existing method in classification.


2006 ◽  
Vol 2 ◽  
pp. 117693510600200 ◽  
Author(s):  
Edward R. Dougherty ◽  
Marcel Brun

The issue of wide feature-set variability has recently been raised in the context of expression-based classification using microarray data. This paper addresses this concern by demonstrating the natural manner in which many feature sets of a certain size chosen from a large collection of potential features can be so close to being optimal that they are statistically indistinguishable. Feature-set optimality is inherently related to sample size because it only arises on account of the tendency for diminished classifier accuracy as the number of features grows too large for satisfactory design from the sample data. The paper considers optimal feature sets in the framework of a model in which the features are grouped in such a way that intra-group correlation is substantial whereas inter-group correlation is minimal, the intent being to model the situation in which there are groups of highly correlated co-regulated genes and there is little correlation between the co-regulated groups. This is accomplished by using a block model for the covariance matrix that reflects these conditions. Focusing on linear discriminant analysis, we demonstrate how these assumptions can lead to very large numbers of close-to-optimal feature sets.


2018 ◽  
Vol 2 (1) ◽  
Author(s):  
عمر صابر قاسم ◽  
محمد علي محمد

تعد مسألة اختيار الميزات (Features selection) الضرورية في عملية تصنيف البيانات (Data Classification) من المسائل ذات الأهمية الكبيرة في تحديد كفاءة التقنية المستخدمة للتصنيف خصوصا عندما يكون حجم هذه البيانات كبيرا جدا مثل بيانات اللوكيميا (leukemia) المعتمدة على الجينات. اذ تم استخدام خوارزمية مقترحة(AGA_SVM) مهجنة بين الخوارزمية الجينية المعدلة (Adaptive Genetic Algorithm) مع تقنية الة المتجه الداعم (Support Vector Machine)، اذ تقوم الخوارزمية الجينية المعدلة بتحويل البيانات من فضاء الأنماط العالي البعد (High-D Patterns Space) إلى فضاء الخواص الواطئ (Low-D Feature Space) لأجل تحديد الميزات الضرورية واللازمة لعملية التصنيف والتي تتم من خلال تقنية الة المتجه الداعم. وتبين من خلال التطبيق على بيانات اللوكيميا ان نسبة التصنيف كانت (100%) لحالات التدريب والاختبار بالنسبة للطريقة المقترحة (AGA_SVM) مقارنة مع الطريقة الاعتيادية التي أخطأت في عدة حالات تصنيف، مما يدل على كفاءة الطريقة المقترحة مقارنة مع الطريقة الاعتيادية.


2021 ◽  
Vol 12 ◽  
Author(s):  
Chen-Yuan Kuo ◽  
Tsung-Ming Tai ◽  
Pei-Lin Lee ◽  
Chiu-Wang Tseng ◽  
Chieh-Yu Chen ◽  
...  

Brain age is an imaging-based biomarker with excellent feasibility for characterizing individual brain health and may serve as a single quantitative index for clinical and domain-specific usage. Brain age has been successfully estimated using extensive neuroimaging data from healthy participants with various feature extraction and conventional machine learning (ML) approaches. Recently, several end-to-end deep learning (DL) analytical frameworks have been proposed as alternative approaches to predict individual brain age with higher accuracy. However, the optimal approach to select and assemble appropriate input feature sets for DL analytical frameworks remains to be determined. In the Predictive Analytics Competition 2019, we proposed a hierarchical analytical framework which first used ML algorithms to investigate the potential contribution of different input features for predicting individual brain age. The obtained information then served as a priori knowledge for determining the input feature sets of the final ensemble DL prediction model. Systematic evaluation revealed that ML approaches with multiple concurrent input features, including tissue volume and density, achieved higher prediction accuracy when compared with approaches with a single input feature set [Ridge regression: mean absolute error (MAE) = 4.51 years, R2 = 0.88; support vector regression, MAE = 4.42 years, R2 = 0.88]. Based on this evaluation, a final ensemble DL brain age prediction model integrating multiple feature sets was constructed with reasonable computation capacity and achieved higher prediction accuracy when compared with ML approaches in the training dataset (MAE = 3.77 years; R2 = 0.90). Furthermore, the proposed ensemble DL brain age prediction model also demonstrated sufficient generalizability in the testing dataset (MAE = 3.33 years). In summary, this study provides initial evidence of how-to efficiency for integrating ML and advanced DL approaches into a unified analytical framework for predicting individual brain age with higher accuracy. With the increase in large open multiple-modality neuroimaging datasets, ensemble DL strategies with appropriate input feature sets serve as a candidate approach for predicting individual brain age in the future.


2020 ◽  
Author(s):  
Richard Woodman ◽  
Kimberley Bryant ◽  
Michael J Sorich ◽  
Alberto Pilotto ◽  
Arduino Aleksander Mangoni

BACKGROUND The Multidimensional Prognostic Index (MPI) is an aggregate comprehensive geriatric assessment scoring system derived from eight domains, that predicts adverse outcomes including 12-month mortality (12MM). However, prediction accuracy, using the 3 MPI categories (mild, moderate, severe risk) as per previous investigations was relatively poor in a recent study with older hospitalized Australian patients. Prediction modelling using the component domains of the MPI together with additional clinical features and Machine Learning (ML) algorithms might improve prediction accuracy OBJECTIVE To assess whether prediction accuracy for 12MM using logistic regression with maximum likelihood estimation (LR-MLE) with the 3-category MPI together with age and gender (feature-set 1) can be improved with the addition of 10 clinical features (sodium, haemoglobin, albumin, creatinine, urea, urea/creatinine ratio, estimated glomerular filtration rate, C-reactive protein, body mass index and anticholinergic risk score) (feature-set 2), and the replacement of the 3-category MPI in feature-sets 1 and 2 by the eight separate MPI domains (feature-sets 3 and 4 respectively). To also assess prediction accuracy of ML algorithms using the same feature-sets. METHODS MPI and clinical features were collected in patients aged ≥65 years admitted to either General Medical or Acute Care of the Elderly wards of a South Australian hospital between September 2015 and February 2017. The diagnostic accuracy of LR-MLE was assessed together with nine ML algorithms: decision-trees, random-forests, eXtreme gradient-boosting (XGBoost), support-vector-machines, naïve-bayes, k-nearest-neighbours, ridge regression, logistic regression without regularisation and neural-networks. A 70:30 Training:Test split of the data and a grid-search of hyper-parameters with 10-fold cross-validation was employed during model training of the ML algorithms. Area-under-curve (AUC) was used to assess prediction accuracy. RESULTS A total of 737 patients (F:M=50.2%/49.8%) with median (IQR) age 80 (72-86) years had complete MPI data recorded on admission and complete 12-month follow-up obtained. The area-under-the receiver-operating-curve (AUC) for LR-MLE was 0.632, 0.688, 0.738 and 0.757 for feature-sets 1 to 4 respectively. The best overall accuracy for the nine ML algorithms was obtained using the XGBoost algorithm (0.635, 0.706, 0.756 and 0.757 for feature-sets 1 to 4 respectively). CONCLUSIONS The use of MPI domains (feature-sets 3 and 4) with LR-MLE considerably improved prediction accuracy compared to that obtained using the traditional 3-category MPI. The XGBoost ML algorithm slightly improved accuracy compared to LR-MLE with feature-sets 1-3 but not with feature-set 4. Adding clinical data also provided small gains in accuracy for LR-MLE and some, but not all ML algorithms. Consideration should be given to using the underlying MPI domains of aggregate scoring systems, additional clinical data and ML algorithms when assessing the risk of 12MM. CLINICALTRIAL N/AMachine learning, Multidimensional Prognostic Index, mortality, diagnostic accuracy, XGBoost


Author(s):  
Xiaojing Gao ◽  
Heru Xue ◽  
Xin Pan ◽  
Xiaoling Luo

Microscopic images of bovine milk somatic cells are used to classify neutrophils, epithelial cells, macrophages and lymphocytes. Using pattern recognition technology, the problem of classification and recognition is solved from different nature, levels and spaces. The proposed RKSGA-SVM algorithm is used to realize somatic cell image recognition. First, color, morphological and texture features of four types of cells are extracted separately, including geometric and moment invariant features. Second, ReliefF algorithm is used to calculate the weights of all features. According to preset cumulative contribution rate, the preliminary feature set is obtained. Third, redundant features are eliminated by Kolmogorov–Smirnov (KS) test, and the high-level optimization is obtained. The selected feature sets have remarkable distinguishing ability. Finally, the weighted optimal feature sets are obtained by weighted coefficient method on the advanced optimal feature sets. The overall accuracy of RKSGA-SVM algorithm is 99.00%, and Kappa coefficient is 0.987. The proposed algorithm has the advantages of balancing classification accuracy, eliminating redundancy and reducing feature dimension. On the premise of ensuring high classification accuracy, the feature set can reduce feature dimension and the amount of data calculation, improve operation efficiency and save storage space. Experiments show that the feature selection method proposed in this paper is feasible and more suitable for extracting feature sets in the process of somatic cell classification.


2012 ◽  
Vol 461 ◽  
pp. 753-756
Author(s):  
Chong Xing ◽  
Yao Wang ◽  
You Zhou ◽  
Yan Chun Liang

Recently, non-coding RNA prediction is the one of the most important researches in bioinformatics. In this paper, on the basis of principal component analysis, we present a tRNA prediction strategy by using least squares support vector machine (LS-SVM). Appearance frequencies of single nucleotide, 2 – nucleotides and (G-C) %, (A-T) % were chosen as characteristics inputs. Results from tests showed that the prediction accuracy was 90.51% on prokaryotic tRNA dataset. Experimental results indicate that the method is effective for prokaryotic ncRNA prediction.


Author(s):  
PAK KIN WONG ◽  
CHI MAN VONG ◽  
CHUN SHUN CHEUNG ◽  
KA IN WONG

To predict the performance of a diesel engine, current practice relies on the use of black-box identification where numerous experiments must be carried out in order to obtain numerical values for model training. Although many diesel engine models based on artificial neural networks (ANNs) have already been developed, they have many drawbacks such as local minima, user burden on selection of optimal network structure, large training data size and poor generalization performance, making themselves difficult to be put into practice. This paper proposes to use extreme learning machine (ELM), which can overcome most of the aforementioned drawbacks, to model the emission characteristics and the brake-specific fuel consumption of the diesel engine under scarce and exponential sample data sets. The resulting ELM model is compared with those developed using popular ANNs such as radial basis function neural network (RBFNN) and advanced techniques such as support vector machine (SVM) and its variants, namely least squares support vector machine (LS-SVM) and relevance vector machine (RVM). Furthermore, some emission outputs of diesel engines suffer from the problem of exponentiality (i.e., the output y grows up exponentially along input x) that will deteriorate the prediction accuracy. A logarithmic transformation is therefore applied to preprocess and post-process the sample data sets in order to improve the prediction accuracy of the model. Evaluation results show that ELM with the logarithmic transformation is better than SVM, LS-SVM, RVM and RBFNN with/without the logarithmic transformation, regardless the model accuracy and training time.


Sign in / Sign up

Export Citation Format

Share Document