Bagging Based Cross-Media Retrieval Algorithm

Abstract It is very challenging to propose a strong learning algorithm with high prediction accuracy of cross-media retrieval, while finding a weak learning algorithm which is slightly higher than that of random prediction is very easy. Inspired by this idea, we propose an imaginative Bagging based cross-media retrieval algorithm (called BCMR) in this paper. First, we utilize bootstrap sampling to carry out random sampling of the original training set. The amount of the sample abstracted by bootstrap is set to be same as the original dataset. Second, 50 bootstrap replicates are used for training 50 weak classifiers independently. We take advantage of homogenous individual classifiers and integrate eight different baseline methods in our experiments. Finally, we generate the final strong classifier from the 50 weak classifiers by the integration strategy of sample voting. We use collective wisdom to eliminate bad decisions so that the generalization ability of the integrated model could be greatly enhanced. Extensive experiments performed on three datasets show that BCMR can effectively improve the accuracy of cross-media retrieval.

Download Full-text

Feature-Weighted Sampling for Proper Evaluation of Classification Models

Applied Sciences ◽

10.3390/app11052039 ◽

2021 ◽

Vol 11 (5) ◽

pp. 2039

Author(s):

Hyunseok Shin ◽

Sejong Oh

Keyword(s):

Random Sampling ◽

Sampling Method ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Feature Importance ◽

Proper Training ◽

Machine Learning Applications ◽

Test Sets ◽

The Given

In machine learning applications, classification schemes have been widely used for prediction tasks. Typically, to develop a prediction model, the given dataset is divided into training and test sets; the training set is used to build the model and the test set is used to evaluate the model. Furthermore, random sampling is traditionally used to divide datasets. The problem, however, is that the performance of the model is evaluated differently depending on how we divide the training and test sets. Therefore, in this study, we proposed an improved sampling method for the accurate evaluation of a classification model. We first generated numerous candidate cases of train/test sets using the R-value-based sampling method. We evaluated the similarity of distributions of the candidate cases with the whole dataset, and the case with the smallest distribution–difference was selected as the final train/test set. Histograms and feature importance were used to evaluate the similarity of distributions. The proposed method produces more proper training and test sets than previous sampling methods, including random and non-random sampling.

Download Full-text

Machine learning to identify lung cancer with tuberculosis from isolated tuberculosis (Preprint)

10.2196/preprints.35101 ◽

2021 ◽

Author(s):

Zhenhao Li

Keyword(s):

Machine Learning ◽

Lung Cancer ◽

Cancer Patients ◽

Tumor Markers ◽

Learning Algorithm ◽

Laboratory Data ◽

Pathological Examination ◽

Training Set ◽

Lung Cancer Patients ◽

Tb Pcr

UNSTRUCTURED Tuberculosis (TB) is a precipitating cause of lung cancer. Lung cancer patients coexisting with TB is difficult to differentiate from isolated TB patients. The aim of this study is to develop a prediction model in identifying those two diseases between the comorbidities and TB. In this work, based on the laboratory data from 389 patients, 81 features, including main laboratory examination of blood test, biochemical test, coagulation assay, tumor markers and baseline information, were initially used as integrated markers and then reduced to form a discrimination system consisting of 31 top-ranked indices. Patients diagnosed with TB PCR ＞1mtb/ml as negative samples, lung cancer patients with TB were confirmed by pathological examination and TB PCR ＞1mtb/ml as positive samples. We used Spatially Uniform ReliefF (SURF) algorithm to determine feature importance, and the predictive model was built using machine learning algorithm Random Forest. For cross-validation, the samples were randomly split into four training set and one test set. The selected features are composed of four tumor markers (Scc, Cyfra21-1, CEA, ProGRP and NSE), fifteen blood biochemical indices (GLU, IBIL, K, CL, Ur, NA, TBA, CHOL, SA, TG, A/G, AST, CA, CREA and CRP), six routine blood indices (EO#, EO%, MCV, RDW-S, LY# and MPV) and four coagulation indices (APTT ratio, APTT, PTA, TT ratio). This model presented a robust and stable classification performance, which can easily differentiate the comorbidity group from the isolated TB group with AUC, ACC, sensitivity and specificity of 0.8817, 0.8654, 0.8594 and 0.8656 for the training set, respectively. Overall, this work may provide a novel strategy for identifying the TB patients with lung cancer from routine admission lab examination with advantages of being timely and economical. It also indicated that our model with enough indices may further increase the effectiveness and efficiency of diagnosis.

Download Full-text

Self-Paced Robust Learning for Leveraging Clean Labels in Noisy Data

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6166 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6853-6860

Author(s):

Xuchao Zhang ◽

Xian Wu ◽

Fanglan Chen ◽

Liang Zhao ◽

Chang-Tien Lu

Keyword(s):

Real World ◽

Large Scale ◽

Learning Algorithm ◽

Noisy Data ◽

Training Set ◽

Robust Learning ◽

Robust Model ◽

Small Set ◽

Real World Datasets ◽

Theoretical Analyses

The success of training accurate models strongly depends on the availability of a sufficient collection of precisely labeled data. However, real-world datasets contain erroneously labeled data samples that substantially hinder the performance of machine learning models. Meanwhile, well-labeled data is usually expensive to obtain and only a limited amount is available for training. In this paper, we consider the problem of training a robust model by using large-scale noisy data in conjunction with a small set of clean data. To leverage the information contained via the clean labels, we propose a novel self-paced robust learning algorithm (SPRL) that trains the model in a process from more reliable (clean) data instances to less reliable (noisy) ones under the supervision of well-labeled data. The self-paced learning process hedges the risk of selecting corrupted data into the training set. Moreover, theoretical analyses on the convergence of the proposed algorithm are provided under mild assumptions. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed approach can achieve a considerable improvement in effectiveness and robustness to existing methods.

Download Full-text

AVERAGE CASE ANALYSIS OF AN HEBB-TYPE RULE THAT FINDS THE NETWORK CONNECTIVITY

International Journal of Neural Systems ◽

10.1142/s012906579400013x ◽

1994 ◽

Vol 05 (02) ◽

pp. 115-122

Author(s):

MOSTEFA GOLEA

Keyword(s):

Network Architecture ◽

Learning Algorithm ◽

Target Function ◽

Network Connectivity ◽

Case Analysis ◽

Average Case Analysis ◽

Training Set ◽

Average Case ◽

Input Variables ◽

Training Examples

We describe an Hebb-type algorithm for learning unions of nonoverlapping perceptrons with binary weights. Two perceptrons are said to be nonoverlapping if they do not share any input variables. The learning algorithm is able to find both the network architecture and the weight values necessary to represent the target function. Moreover, the algorithm is local, homogeneous, and simple enough to be biologically plausible. We investigate the average behavior of this algorithm as a function of the size of the training set. We find that, as the size of the training set increases, the hypothesis network built by the algorithm “converges” to the target network, both in terms of the number of perceptrons and the connectivity. Moreover, the generalization rate converges exponentially to perfect generalization as a function of the number of training examples. The analytic expressions are in excellent agreement with the numerical simulations. To our knowledge, this is the first average case analysis of an algorithm that finds both the weight values and the network connectivity.

Download Full-text

The Existence of A Priori Distinctions Between Learning Algorithms

Neural Computation ◽

10.1162/neco.1996.8.7.1391 ◽

1996 ◽

Vol 8 (7) ◽

pp. 1391-1420 ◽

Cited By ~ 48

Author(s):

David H. Wolpert

Keyword(s):

Cross Validation ◽

Learning Algorithm ◽

A Priori ◽

Learning Algorithms ◽

Loss Functions ◽

Average Error ◽

Quadratic Loss ◽

Training Set ◽

Natural Restriction ◽

Cross Validation Error

This is the second of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. The first paper discusses a particular set of ways to compare learning algorithms, according to which there are no distinctions between learning algorithms. This second paper concentrates on different ways of comparing learning algorithms from those used in the first paper. In particular this second paper discusses the associated a priori distinctions that do exist between learning algorithms. In this second paper it is shown, loosely speaking, that for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms. However, even for such loss functions, it is shown here that any algorithm is equivalent on average to its “randomized” version, and in this still has no first principles justification in terms of average error. Nonetheless, as this paper discusses, it may be that (for example) cross-validation has better head-to-head minimax properties than “anti-cross-validation” (choose the learning algorithm with the largest cross-validation error). This may be true even for zero-one loss, a loss function for which the notion of “randomization” would not be relevant. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors over targets. Accordingly they prove, as a particular example, that cross-validation cannot be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-cross-validation rather than cross-validation (!).

Download Full-text

FUSION OF EXTREME LEARNING MACHINE WITH FUZZY INTEGRAL

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488513400138 ◽

2013 ◽

Vol 21 (supp02) ◽

pp. 23-34 ◽

Cited By ~ 24

Author(s):

JUNHAI ZHAI ◽

HONGYU XU ◽

YAN LI

Keyword(s):

Extreme Learning Machine ◽

Learning Algorithm ◽

Fuzzy Integral ◽

Fast Learning ◽

Original Dataset ◽

Learning Speed ◽

Feed Forward Neural Networks ◽

Three Stages ◽

Learning Machine ◽

Hidden Layer

Extreme learning machine (ELM) is an efficient and practical learning algorithm used for training single hidden layer feed-forward neural networks (SLFNs). ELM can provide good generalization performance at extremely fast learning speed. However, ELM suffers from instability and over-fitting, especially on relatively large datasets. Based on probabilistic SLFNs, an approach of fusion of extreme learning machine (F-ELM) with fuzzy integral is proposed in this paper. The proposed algorithm consists of three stages. Firstly, the bootstrap technique is employed to generate several subsets of original dataset. Secondly, probabilistic SLFNs are trained with ELM algorithm on each subset. Finally, the trained probabilistic SLFNs are fused with fuzzy integral. The experimental results show that the proposed approach can alleviate to some extent the problems mentioned above, and can increase the prediction accuracy.

Download Full-text

ACCELERATED LEARNING BY ACTIVE EXAMPLE SELECTION

International Journal of Neural Systems ◽

10.1142/s0129065794000086 ◽

1994 ◽

Vol 05 (01) ◽

pp. 67-75 ◽

Cited By ~ 32

Author(s):

BYOUNG-TAK ZHANG

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Learning Algorithm ◽

Accelerated Learning ◽

Training Set ◽

Alternative Approach ◽

Speed Up ◽

Multilayer Neural Networks ◽

Training Examples ◽

The Given

Much previous work on training multilayer neural networks has attempted to speed up the backpropagation algorithm using more sophisticated weight modification rules, whereby all the given training examples are used in a random or predetermined sequence. In this paper we investigate an alternative approach in which the learning proceeds on an increasing number of selected training examples, starting with a small training set. We derive a measure of criticality of examples and present an incremental learning algorithm that uses this measure to select a critical subset of given examples for solving the particular task. Our experimental results suggest that the method can significantly improve training speed and generalization performance in many real applications of neural networks. This method can be used in conjunction with other variations of gradient descent algorithms.

Download Full-text

QTG-Finder2: A Generalized Machine-Learning Algorithm for Prioritizing QTL Causal Genes in Plants

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401122 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2411-2421

Author(s):

Fan Lin ◽

Elena Z. Lazarus ◽

Seung Y. Rhee

Keyword(s):

Machine Learning ◽

Linkage Mapping ◽

Learning Algorithm ◽

Machine Learning Algorithm ◽

Causal Gene ◽

Training Set ◽

Average Precision ◽

Trait Improvement ◽

Causal Genes ◽

Mapping Process

Linkage mapping has been widely used to identify quantitative trait loci (QTL) in many plants and usually requires a time-consuming and labor-intensive fine mapping process to find the causal gene underlying the QTL. Previously, we described QTG-Finder, a machine-learning algorithm to rationally prioritize candidate causal genes in QTLs. While it showed good performance, QTG-Finder could only be used in Arabidopsis and rice because of the limited number of known causal genes in other species. Here we tested the feasibility of enabling QTG-Finder to work on species that have few or no known causal genes by using orthologs of known causal genes as the training set. The model trained with orthologs could recall about 64% of Arabidopsis and 83% of rice causal genes when the top 20% ranked genes were considered, which is similar to the performance of models trained with known causal genes. The average precision was 0.027 for Arabidopsis and 0.029 for rice. We further extended the algorithm to include polymorphisms in conserved non-coding sequences and gene presence/absence variation as additional features. Using this algorithm, QTG-Finder2, we trained and cross-validated Sorghum bicolor and Setaria viridis models. The S. bicolor model was validated by causal genes curated from the literature and could recall 70% of causal genes when the top 20% ranked genes were considered. In addition, we applied the S. viridis model and public transcriptome data to prioritize a plant height QTL and identified 13 candidate genes. QTL-Finder2 can accelerate the discovery of causal genes in any plant species and facilitate agricultural trait improvement.

Download Full-text

AUTOMATED CHARACTERIZATION OF CARDIOVASCULAR DISEASES USING WAVELET TRANSFORM FEATURES EXTRACTED FROM ECG SIGNALS

Journal of Mechanics in Medicine and Biology ◽

10.1142/s0219519419400098 ◽

2019 ◽

Vol 19 (01) ◽

pp. 1940009 ◽

Cited By ~ 3

Author(s):

AHMAD MOHSIN ◽

OLIVER FAUST

Keyword(s):

Wavelet Transform ◽

Computer Program ◽

Cross Validation ◽

Learning Algorithm ◽

Confusion Matrix ◽

Disease Diagnosis ◽

Discrete Wavelet ◽

K Nearest Neighbor ◽

Training Set ◽

Ecg Signals

Cardiovascular disease has been the leading cause of death worldwide. Electrocardiogram (ECG)-based heart disease diagnosis is simple, fast, cost effective and non-invasive. However, interpreting ECG waveforms can be taxing for a clinician who has to deal with hundreds of patients during a day. We propose computing machinery to reduce the workload of clinicians and to streamline the clinical work processes. Replacing human labor with machine work can lead to cost savings. Furthermore, it is possible to improve the diagnosis quality by reducing inter- and intra-observer variability. To support that claim, we created a computer program that recognizes normal, Dilated Cardiomyopathy (DCM), Hypertrophic Cardiomyopathy (HCM) or Myocardial Infarction (MI) ECG signals. The computer program combined Discrete Wavelet Transform (DWT) based feature extraction and K-Nearest Neighbor (K-NN) classification for discriminating the signal classes. The system was verified with tenfold cross validation based on labeled data from the PTB diagnostic ECG database. During the validation, we adjusted the number of neighbors [Formula: see text] for the machine learning algorithm. For [Formula: see text], training set has an accuracy and cross validation of 98.33% and 95%, respectively. However, when [Formula: see text], it showed constant for training set but dropped drastically to 80% for cross-validation. Hence, training set [Formula: see text] prevails. Furthermore, a confusion matrix proved that normal data was identified with 96.7% accuracy, 99.6% sensitivity and 99.4% specificity. This means an error of 3.3% will occur. For every 30 normal signals, the classifier will mislabel only 1 of the them as HCM. With these results, we are confident that the proposed system can improve the speed and accuracy with which normal and diseased subjects are identified. Diseased subjects can be treated earlier which improves their probability of survival.

Download Full-text

STRUCTURAL CONNECTIONIST LEARNING WITH COMPLEMENTARY CODING

International Journal of Neural Systems ◽

10.1142/s0129065792000036 ◽

1992 ◽

Vol 03 (01) ◽

pp. 19-30 ◽

Cited By ~ 10

Author(s):

AKIRA NAMATAME ◽

YOSHIAKI TSUKAMOTO

Keyword(s):

Learning Algorithm ◽

Internal Representation ◽

Threshold Function ◽

Sufficient Condition ◽

Structural Learning ◽

Similarity Matrix ◽

Training Set ◽

Threshold Functions ◽

Connectionist Networks ◽

Hidden Layer

We propose a new learning algorithm, structural learning with the complementary coding for concept learning problems. We introduce the new grouping measure that forms the similarity matrix over the training set and show this similarity matrix provides a sufficient condition for the linear separability of the set. Using the sufficient condition one should figure out a suitable composition of linearly separable threshold functions that classify exactly the set of labeled vectors. In the case of the nonlinear separability, the internal representation of connectionist networks, the number of the hidden units and value-space of these units, is pre-determined before learning based on the structure of the similarity matrix. A three-layer neural network is then constructed where each linearly separable threshold function is computed by a linear-threshold unit whose weights are determined by the one-shot learning algorithm that requires a single presentation of the training set. The structural learning algorithm proceeds to capture the connection weights so as to realize the pre-determined internal representation. The pre-structured internal representation, the activation value spaces at the hidden layer, defines intermediate-concepts. The target-concept is then learned as a combination of those intermediate-concepts. The ability to create the pre-structured internal representation based on the grouping measure distinguishes the structural learning from earlier methods such as backpropagation.

Download Full-text