The Existence of A Priori Distinctions Between Learning Algorithms

This is the second of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. The first paper discusses a particular set of ways to compare learning algorithms, according to which there are no distinctions between learning algorithms. This second paper concentrates on different ways of comparing learning algorithms from those used in the first paper. In particular this second paper discusses the associated a priori distinctions that do exist between learning algorithms. In this second paper it is shown, loosely speaking, that for loss functions other than zero-one (e.g., quadratic loss), there are a priori distinctions between algorithms. However, even for such loss functions, it is shown here that any algorithm is equivalent on average to its “randomized” version, and in this still has no first principles justification in terms of average error. Nonetheless, as this paper discusses, it may be that (for example) cross-validation has better head-to-head minimax properties than “anti-cross-validation” (choose the learning algorithm with the largest cross-validation error). This may be true even for zero-one loss, a loss function for which the notion of “randomization” would not be relevant. This paper also analyzes averages over hypotheses rather than targets. Such analyses hold for all possible priors over targets. Accordingly they prove, as a particular example, that cross-validation cannot be justified as a Bayesian procedure. In fact, for a very natural restriction of the class of learning algorithms, one should use anti-cross-validation rather than cross-validation (!).

Download Full-text

The Lack of A Priori Distinctions Between Learning Algorithms

Neural Computation ◽

10.1162/neco.1996.8.7.1341 ◽

1996 ◽

Vol 8 (7) ◽

pp. 1341-1390 ◽

Cited By ~ 594

Author(s):

David H. Wolpert

Keyword(s):

Cross Validation ◽

Learning Algorithm ◽

A Priori ◽

Learning Algorithms ◽

Computational Learning Theory ◽

Misclassification Rate ◽

Computational Learning ◽

Training Set ◽

Cross Validation Error ◽

The Senses

This is the first of two papers that use off-training set (OTS) error to investigate the assumption-free relationship between learning algorithms. This first paper discusses the senses in which there are no a priori distinctions between learning algorithms. (The second paper discusses the senses in which there are such distinctions.) In this first paper it is shown, loosely speaking, that for any two algorithms A and B, there are “as many” targets (or priors over targets) for which A has lower expected OTS error than B as vice versa, for loss functions like zero-one loss. In particular, this is true if A is cross-validation and B is “anti-cross-validation” (choose the learning algorithm with largest cross-validation error). This paper ends with a discussion of the implications of these results for computational learning theory. It is shown that one cannot say: if empirical misclassification rate is low, the Vapnik-Chervonenkis dimension of your generalizer is small, and the training set is large, then with high probability your OTS error is small. Other implications for “membership queries” algorithms and “punting” algorithms are also discussed.

Download Full-text

AUTOMATED CHARACTERIZATION OF CARDIOVASCULAR DISEASES USING WAVELET TRANSFORM FEATURES EXTRACTED FROM ECG SIGNALS

Journal of Mechanics in Medicine and Biology ◽

10.1142/s0219519419400098 ◽

2019 ◽

Vol 19 (01) ◽

pp. 1940009 ◽

Cited By ~ 3

Author(s):

AHMAD MOHSIN ◽

OLIVER FAUST

Keyword(s):

Wavelet Transform ◽

Computer Program ◽

Cross Validation ◽

Learning Algorithm ◽

Confusion Matrix ◽

Disease Diagnosis ◽

Discrete Wavelet ◽

K Nearest Neighbor ◽

Training Set ◽

Ecg Signals

Cardiovascular disease has been the leading cause of death worldwide. Electrocardiogram (ECG)-based heart disease diagnosis is simple, fast, cost effective and non-invasive. However, interpreting ECG waveforms can be taxing for a clinician who has to deal with hundreds of patients during a day. We propose computing machinery to reduce the workload of clinicians and to streamline the clinical work processes. Replacing human labor with machine work can lead to cost savings. Furthermore, it is possible to improve the diagnosis quality by reducing inter- and intra-observer variability. To support that claim, we created a computer program that recognizes normal, Dilated Cardiomyopathy (DCM), Hypertrophic Cardiomyopathy (HCM) or Myocardial Infarction (MI) ECG signals. The computer program combined Discrete Wavelet Transform (DWT) based feature extraction and K-Nearest Neighbor (K-NN) classification for discriminating the signal classes. The system was verified with tenfold cross validation based on labeled data from the PTB diagnostic ECG database. During the validation, we adjusted the number of neighbors [Formula: see text] for the machine learning algorithm. For [Formula: see text], training set has an accuracy and cross validation of 98.33% and 95%, respectively. However, when [Formula: see text], it showed constant for training set but dropped drastically to 80% for cross-validation. Hence, training set [Formula: see text] prevails. Furthermore, a confusion matrix proved that normal data was identified with 96.7% accuracy, 99.6% sensitivity and 99.4% specificity. This means an error of 3.3% will occur. For every 30 normal signals, the classifier will mislabel only 1 of the them as HCM. With these results, we are confident that the proposed system can improve the speed and accuracy with which normal and diseased subjects are identified. Diseased subjects can be treated earlier which improves their probability of survival.

Download Full-text

GBCNet: In-Field Grape Berries Counting for Yield Estimation by Dilated CNNs

Applied Sciences ◽

10.3390/app10144870 ◽

2020 ◽

Vol 10 (14) ◽

pp. 4870 ◽

Cited By ~ 2

Author(s):

Luca Coviello ◽

Marco Cristoforetti ◽

Giuseppe Jurman ◽

Cesare Furlanello

Keyword(s):

Deep Learning ◽

Cross Validation ◽

Learning Algorithms ◽

Average Error ◽

Yield Estimation ◽

Grape Berries ◽

Accuracy Level ◽

Validation Procedure ◽

Grape Varieties ◽

Single Variety

We introduce here the Grape Berries Counting Net (GBCNet), a tool for accurate fruit yield estimation from smartphone cameras, by adapting Deep Learning algorithms originally developed for crowd counting. We test GBCNet using cross-validation procedure on two original datasets CR1 and CR2 of grape pictures taken in-field before veraison. A total of 35,668 berries have been manually annotated for the task. GBCNet achieves good performances on both the seven grape varieties dataset CR1, although with a different accuracy level depending on the variety, and on the single variety dataset CR2: in particular Mean Average Error (MAE) ranges from 0.85% for Pinot Gris to 11.73% for Marzemino on CR1 and reaches 7.24% on the Teroldego CR2 dataset.

Download Full-text

Improving the Reliability of Mixture Tuned Matched Filtering Remote Sensing Classification Results Using Supervised Learning Algorithms and Cross-Validation

Remote Sensing ◽

10.3390/rs10111675 ◽

2018 ◽

Vol 10 (11) ◽

pp. 1675 ◽

Cited By ~ 3

Author(s):

Devin Routh ◽

Lindsi Seegmiller ◽

Charlie Bettigole ◽

Catherine Kuhn ◽

Chadwick D. Oliver ◽

...

Keyword(s):

Supervised Learning ◽

Cross Validation ◽

Hyperspectral Image ◽

Learning Algorithm ◽

Learning Algorithms ◽

Hyperspectral Data ◽

Matched Filtering ◽

User Input ◽

Remote Sensing Classification ◽

Supervised Learning Algorithms

Mixture tuned matched filtering (MTMF) image classification capitalizes on the increasing spectral and spatial resolutions of available hyperspectral image data to identify the presence, and potentially the abundance, of a given cover type or endmember. Previous studies using MTMF have relied on extensive user input to obtain a reliable classification. In this study, we expand the traditional MTMF classification by using a selection of supervised learning algorithms with rigorous cross-validation. Our approach removes the need for subjective user input to finalize the classification, ultimately enhancing replicability and reliability of the results. We illustrate this approach with an MTMF classification case study focused on leafy spurge (Euphorbia esula), an invasive forb in Western North America, using free 30-m hyperspectral data from the National Aeronautics and Space Administration’s (NASA) Hyperion sensor. Our protocol shows for our data, a potential overall accuracy inflation between 18.4% and 30.8% without cross-validation and according to the supervised learning algorithm used. We propose this new protocol as a final step for the MTMF classification algorithm and suggest future researchers report a greater suite of accuracy statistics to affirm their classifications’ underlying efficacies.

Download Full-text

Intelligent system of English composition scoring model based on improved machine learning algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189235 ◽

2020 ◽

pp. 1-11

Author(s):

Jie Liu ◽

Lin Lin ◽

Xiufang Liang

Keyword(s):

Machine Learning ◽

Evaluation System ◽

Intelligent System ◽

Learning Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Assessment System ◽

English Composition ◽

Region Extraction ◽

Constraint Model

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.

Download Full-text

PREDIKSI KUALITAS AIR SUNGAI CILIWUNG DENGAN MENGGUNAKAN ALGORITMA POHON KEPUTUSAN

Jurnal Air Indonesia ◽

10.29122/jai.v12i2.4364 ◽

2021 ◽

Vol 12 (2) ◽

Author(s):

Mohammad Haekal ◽

Henki Bayu Seta ◽

Mayanda Mega Santoni

Keyword(s):

Data Mining ◽

Decision Tree ◽

Cross Validation ◽

Online Monitoring ◽

Training Set ◽

Microsoft Excel ◽

Test Set

Untuk memprediksi kualitas air sungai Ciliwung, telah dilakukan pengolahan data-data hasil pemantauan secara Online Monitoring dengan menggunakan Metode Data Mining. Pada metode ini, pertama-tama data-data hasil pemantauan dibuat dalam bentuk tabel Microsoft Excel, kemudian diolah menjadi bentuk Pohon Keputusan yang disebut Algoritma Pohon Keputusan (Decision Tree) mengunakan aplikasi WEKA. Metode Pohon Keputusan dipilih karena lebih sederhana, mudah dipahami dan mempunyai tingkat akurasi yang sangat tinggi. Jumlah data hasil pemantauan kualitas air sungai Ciliwung yang diolah sebanyak 5.476 data. Hasil klarifikasi dengan Pohon Keputusan, dari 5.476 data ini diperoleh jumlah data yang mengindikasikan sungai Ciliwung Tidak Tercemar sebanyak 1.059 data atau sebesar 19,3242%, dan yang mengindikasikan Tercemar sebanyak 4.417 data atau 80,6758%. Selanjutnya data-data hasil pemantauan ini dievaluasi menggunakan 4 Opsi Tes (Test Option) yaitu dengan Use Training Set, Supplied Test Set, Cross-Validation folds 10, dan Percentage Split 66%. Hasil evaluasi dengan 4 opsi tes yang digunakan ini, semuanya menunjukkan tingkat akurasi yang sangat tinggi, yaitu diatas 99%. Dari data-data hasil peneltian ini dapat diprediksi bahwa sungai Ciliwung terindikasi sebagai sungai tercemar bila mereferensi kepada Peraturan Pemerintah Republik Indonesia nomor 82 tahun 2001 dan diketahui pula bahwa penggunaan aplikasi WEKA dengan Algoritma Pohon Keputusan untuk mengolah data-data hasil pemantauan dengan mengambil tiga parameter (pH, DO dan Nitrat) adalah sangat akuran dan tepat. Kata Kunci : Kualitas air sungai, Data Mining, Algoritma Pohon Keputusan, Aplikasi WEKA.

Download Full-text

Study on Radar Echo-Filling in an Occlusion Area by a Deep Learning Algorithm

Remote Sensing ◽

10.3390/rs13091779 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1779

Author(s):

Xiaoyan Yin ◽

Zhiqun Hu ◽

Jiafeng Zheng ◽

Boyong Li ◽

Yuanyuan Zuo

Keyword(s):

Deep Learning ◽

Loss Function ◽

Learning Algorithm ◽

Weather Radar ◽

Loss Functions ◽

Training Dataset ◽

Echo Intensity ◽

Common Mean ◽

Deep Learning Algorithm ◽

Radar Beam

Radar beam blockage is an important error source that affects the quality of weather radar data. An echo-filling network (EFnet) is proposed based on a deep learning algorithm to correct the echo intensity under the occlusion area in the Nanjing S-band new-generation weather radar (CINRAD/SA). The training dataset is constructed by the labels, which are the echo intensity at the 0.5° elevation in the unblocked area, and by the input features, which are the intensity in the cube including multiple elevations and gates corresponding to the location of bottom labels. Two loss functions are applied to compile the network: one is the common mean square error (MSE), and the other is a self-defined loss function that increases the weight of strong echoes. Considering that the radar beam broadens with distance and height, the 0.5° elevation scan is divided into six range bands every 25 km to train different models. The models are evaluated by three indicators: explained variance (EVar), mean absolute error (MAE), and correlation coefficient (CC). Two cases are demonstrated to compare the effect of the echo-filling model by different loss functions. The results suggest that EFnet can effectively correct the echo reflectivity and improve the data quality in the occlusion area, and there are better results for strong echoes when the self-defined loss function is used.

Download Full-text

On the Unfounded Enthusiasm for Soft Selective Sweeps III: The Supervised Machine Learning Algorithm That Isn’t

Genes ◽

10.3390/genes12040527 ◽

2021 ◽

Vol 12 (4) ◽

pp. 527

Author(s):

Eran Elhaik ◽

Dan Graur

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

A Priori ◽

Neutral Theory ◽

Dominant Mode ◽

Supervised Machine Learning ◽

Training Dataset ◽

Selective Sweeps ◽

Two Factors ◽

Negative Controls

In the last 15 years or so, soft selective sweep mechanisms have been catapulted from a curiosity of little evolutionary importance to a ubiquitous mechanism claimed to explain most adaptive evolution and, in some cases, most evolution. This transformation was aided by a series of articles by Daniel Schrider and Andrew Kern. Within this series, a paper entitled “Soft sweeps are the dominant mode of adaptation in the human genome” (Schrider and Kern, Mol. Biol. Evolut. 2017, 34(8), 1863–1877) attracted a great deal of attention, in particular in conjunction with another paper (Kern and Hahn, Mol. Biol. Evolut. 2018, 35(6), 1366–1371), for purporting to discredit the Neutral Theory of Molecular Evolution (Kimura 1968). Here, we address an alleged novelty in Schrider and Kern’s paper, i.e., the claim that their study involved an artificial intelligence technique called supervised machine learning (SML). SML is predicated upon the existence of a training dataset in which the correspondence between the input and output is known empirically to be true. Curiously, Schrider and Kern did not possess a training dataset of genomic segments known a priori to have evolved either neutrally or through soft or hard selective sweeps. Thus, their claim of using SML is thoroughly and utterly misleading. In the absence of legitimate training datasets, Schrider and Kern used: (1) simulations that employ many manipulatable variables and (2) a system of data cherry-picking rivaling the worst excesses in the literature. These two factors, in addition to the lack of negative controls and the irreproducibility of their results due to incomplete methodological detail, lead us to conclude that all evolutionary inferences derived from so-called SML algorithms (e.g., S/HIC) should be taken with a huge shovel of salt.

Download Full-text

Assessing biases, relaxing moralism: On ground-truthing practices in machine learning design and application

Big Data & Society ◽

10.1177/20539517211013569 ◽

2021 ◽

Vol 8 (1) ◽

pp. 205395172110135

Author(s):

Florian Jaton

Keyword(s):

Machine Learning ◽

William James ◽

A Priori ◽

Learning Algorithms ◽

Three Dimensional ◽

Ground Truth ◽

Machine Learning Algorithms ◽

Ground Truthing ◽

Set Up ◽

The Moment

This theoretical paper considers the morality of machine learning algorithms and systems in the light of the biases that ground their correctness. It begins by presenting biases not as a priori negative entities but as contingent external referents—often gathered in benchmarked repositories called ground-truth datasets—that define what needs to be learned and allow for performance measures. I then argue that ground-truth datasets and their concomitant practices—that fundamentally involve establishing biases to enable learning procedures—can be described by their respective morality, here defined as the more or less accounted experience of hesitation when faced with what pragmatist philosopher William James called “genuine options”—that is, choices to be made in the heat of the moment that engage different possible futures. I then stress three constitutive dimensions of this pragmatist morality, as far as ground-truthing practices are concerned: (I) the definition of the problem to be solved (problematization), (II) the identification of the data to be collected and set up (databasing), and (III) the qualification of the targets to be learned (labeling). I finally suggest that this three-dimensional conceptual space can be used to map machine learning algorithmic projects in terms of the morality of their respective and constitutive ground-truthing practices. Such techno-moral graphs may, in turn, serve as equipment for greater governance of machine learning algorithms and systems.

Download Full-text

A Robust Method to Predict Fluid Properties Based on Big Data and Machine Learning Algorithms

10.2523/iptc-21356-ms ◽

2021 ◽

Author(s):

Yingxian Liu ◽

Cunliang Chen ◽

Hanqing Zhao ◽

Yu Wang ◽

Xiaodong Han

Keyword(s):

Machine Learning ◽

Physical Properties ◽

Learning Algorithm ◽

Direct Method ◽

Learning Algorithms ◽

Small Error ◽

Machine Learning Algorithms ◽

Well Test ◽

Empirical Formulas ◽

Fluid Properties

Abstract Fluid properties are key factors for predicting single well productivity, well test interpretation and oilfield recovery prediction, which directly affect the success of ODP program design. The most accurate and direct method of acquisition is underground sampling. However, not every well has samples due to technical reasons such as excessive well deviation or high cost during the exploration stage. Therefore, analogies or empirical formulas have to be adopted to carry out research in many cases. But a large number of oilfield developments have shown that the errors caused by these methods are very large. Therefore, how to quickly and accurately obtain fluid physical properties is of great significance. In recent years, with the development and improvement of artificial intelligence or machine learning algorithms, their applications in the oilfield have become more and more extensive. This paper proposed a method for predicting crude oil physical properties based on machine learning algorithms. This method uses PVT data from nearly 100 wells in Bohai Oilfield. 75% of the data is used for training and learning to obtain the prediction model, and the remaining 25% is used for testing. Practice shows that the prediction results of the machine learning algorithm are very close to the actual data, with a very small error. Finally, this method was used to apply the preliminary plan design of the BZ29 oilfield which is a new oilfield. Especially for the unsampled sand bodies, the fluid physical properties prediction was carried out. It also compares the influence of the analogy method on the scheme, which provides potential and risk analysis for scheme design. This method will be applied in more oil fields in the Bohai Sea in the future and has important promotion value.

Download Full-text