A CEREBELLAR MODEL CLASSIFIER FOR DATA MINING WITH LINEAR TIME COMPLEXITY

Author(s):  
DAVID CORNFORTH

Techniques for automated classification need to be efficient when applied to large datasets. Machine learning techniques such as neural networks have been successfully applied to this class of problem, but training times can blow out as the size of the database increases. Some of the desirable features of classification algorithms for large databases are linear time complexity, training with only a single pass of the data, and accountability for class assignment decisions. A new training algorithm for classifiers based on the Cerebellar Model Articulation Controller (CMAC) possesses these features. An empirical investigation of this algorithm has found it to be superior to the traditional CMAC training algorithm, both in accuracy and time required to learn mappings between input vectors and class labels.

2021 ◽  
Vol 14 (3) ◽  
pp. 1-21
Author(s):  
Roy Abitbol ◽  
Ilan Shimshoni ◽  
Jonathan Ben-Dov

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.


2020 ◽  
Vol 10 (7) ◽  
pp. 2406
Author(s):  
Valentín Moreno ◽  
Gonzalo Génova ◽  
Manuela Alejandres ◽  
Anabel Fraga

Our purpose in this research is to develop a method to automatically and efficiently classify web images as Unified Modeling Language (UML) static diagrams, and to produce a computer tool that implements this function. The tool receives a bitmap file (in different formats) as an input and communicates whether the image corresponds to a diagram. For pragmatic reasons, we restricted ourselves to the simplest kinds of diagrams that are more useful for automated software reuse: computer-edited 2D representations of static diagrams. The tool does not require that the images are explicitly or implicitly tagged as UML diagrams. The tool extracts graphical characteristics from each image (such as grayscale histogram, color histogram and elementary geometric forms) and uses a combination of rules to classify it. The rules are obtained with machine learning techniques (rule induction) from a sample of 19,000 web images manually classified by experts. In this work, we do not consider the textual contents of the images. Our tool reaches nearly 95% of agreement with manually classified instances, improving the effectiveness of related research works. Moreover, using a training dataset 15 times bigger, the time required to process each image and extract its graphical features (0.680 s) is seven times lower.


2019 ◽  
Vol 8 (7) ◽  
pp. 1050 ◽  
Author(s):  
Meghana Padmanabhan ◽  
Pengyu Yuan ◽  
Govind Chada ◽  
Hien Van Nguyen

Machine learning is often perceived as a sophisticated technology accessible only by highly trained experts. This prevents many physicians and biologists from using this tool in their research. The goal of this paper is to eliminate this out-dated perception. We argue that the recent development of auto machine learning techniques enables biomedical researchers to quickly build competitive machine learning classifiers without requiring in-depth knowledge about the underlying algorithms. We study the case of predicting the risk of cardiovascular diseases. To support our claim, we compare auto machine learning techniques against a graduate student using several important metrics, including the total amounts of time required for building machine learning models and the final classification accuracies on unseen test datasets. In particular, the graduate student manually builds multiple machine learning classifiers and tunes their parameters for one month using scikit-learn library, which is a popular machine learning library to obtain ones that perform best on two given, publicly available datasets. We run an auto machine learning library called auto-sklearn on the same datasets. Our experiments find that automatic machine learning takes 1 h to produce classifiers that perform better than the ones built by the graduate student in one month. More importantly, building this classifier only requires a few lines of standard code. Our findings are expected to change the way physicians see machine learning and encourage wide adoption of Artificial Intelligence (AI) techniques in clinical domains.


2020 ◽  
Vol 19 (01) ◽  
pp. 2040016
Author(s):  
Fahad Alahmari

Data imbalance with respect to the class labels has been recognised as a challenging problem for machine learning techniques as it has a direct impact on the classification model’s performance. In an imbalanced dataset, most of the instances belong to one class, while far fewer instances are associated with the remaining classes. Most of the machine learning algorithms tend to favour the majority class and ignore the minority classes leading to classification models being generated that cannot be generalised. This paper investigates the problem of class imbalance for a medical application related to autism spectrum disorder (ASD) screening to identify the ideal data resampling method that can stabilise classification performance. To achieve the aim, experimental analyses to measure the performance of different oversampling and under-sampling techniques have been conducted on a real imbalanced ASD dataset related to adults. The results produced by multiple classifiers on the considered datasets showed superiority in terms of specificity, sensitivity, and precision, among others, when adopting oversampling techniques in the pre-processing phase.


In this paper the comparative study of two supervised machine learning techniques for classification problems has been done. Due to the real-time processing ability of neural network, it is having numerous applications in many fields. SVM is also very popular supervised learning algorithm because of its good generalization power. This paper presents the thorough study of the presented classification algorithm and their comparative study of accuracy and speed which would help other researchers to develop novel algorithms for applications. The comparative study showed that the performance of SVM is better when dealing with multidimensions and continuous features. The selection and settings of the kernel function are essential for SVM optimality


2020 ◽  
Author(s):  
Alexander Francois Danvers ◽  
David Sbarra ◽  
Matthias R. Mehl

Ambulatory assessment methods provide a rich approach for studying daily behavior. Too often, however, these data are analyzed in terms of averages, neglecting patterning of this behavior over time. This paper describes Recurrence Quantification Analysis (RQA), a non-linear time series technique for analyzing dynamic systems, as a method for analyzing patterns of categorical, intensive longitudinal ambulatory assessment data. We apply RQA to objectively-assessed social behavior (e.g. talking to another person) coded from the Electronically Activated Recorder (EAR). Conceptual interpretations of RQA parameters, and an analysis of EAR data in adults going through a marital separation, are provided. Using machine learning techniques to avoid model overfitting, we find that adding RQA parameters to models that include just average amount of time spent talking (a static measure) improves prediction of four Big Five personality traits: extraversion, neuroticism, conscientiousness, and openness. Our strongest results suggest that a combination of average amount of time spent talking and four RQA parameters yield an R2 = .09 for neuroticism. Neuroticism is shown to be associated with shorter periods of extended conversation (periods of at least 12 minutes), demonstrating the utility of RQA to identify new relationships between personality and patterns of daily behavior. Materials: https://osf.io/5nkr9/


Sign in / Sign up

Export Citation Format

Share Document