Fast Computation on Massive Data Sets

This chapter describes basic concepts and tools for tractably performing the computations described in the rest of this book. The need for fast algorithms for such analysis subroutines is becoming increasingly important as modern data sets are approaching billions of objects. With such data sets, even analysis operations whose computational cost is linearly proportional to the size of the data set present challenges, particularly since statistical analyses are inherently interactive processes, requiring that computations complete within some reasonable human attention span. For more sophisticated machine learning algorithms, the often worse-than-linear runtimes of straightforward implementations become quickly unbearable. The chapter looks at some techniques that can reduce such runtimes in a rigorous manner that does not sacrifice the accuracy of the analysis through unprincipled approximations. This is far more important than simply speeding up calculations: in practice, computational performance and statistical performance can be intimately linked. The ability of a researcher, within his or her effective time budget, to try more powerful models or to search parameter settings for each model in question, leads directly to better fits and predictions.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

Birds Sound Classification Based on Machine Learning Algorithms

Asian Journal of Research in Computer Science ◽

10.9734/ajrcos/2021/v9i430227 ◽

2021 ◽

pp. 1-11

Author(s):

Aska E. Mehyadin ◽

Adnan Mohsin Abdulazeez ◽

Dathar Abas Hasan ◽

Jwan N. Saeed

Keyword(s):

Machine Learning ◽

Noise Suppression ◽

Bird Species ◽

Machine Learning Algorithms ◽

Data Sets ◽

Learning Technology ◽

Species Classification ◽

Data Set ◽

Sound Classification ◽

Mel Frequency Cepstral Coefficient

The bird classifier is a system that is equipped with an area machine learning technology and uses a machine learning method to store and classify bird calls. Bird species can be known by recording only the sound of the bird, which will make it easier for the system to manage. The system also provides species classification resources to allow automated species detection from observations that can teach a machine how to recognize whether or classify the species. Non-undesirable noises are filtered out of and sorted into data sets, where each sound is run via a noise suppression filter and a separate classification procedure so that the most useful data set can be easily processed. Mel-frequency cepstral coefficient (MFCC) is used and tested through different algorithms, namely Naïve Bayes, J4.8 and Multilayer perceptron (MLP), to classify bird species. J4.8 has the highest accuracy (78.40%) and is the best. Accuracy and elapsed time are (39.4 seconds).

Download Full-text

Modeling Superposition of Flat Plate Film Cooling Under Complicated Conditions Using Recurrent Neural Networks

Volume 7B: Heat Transfer ◽

10.1115/gt2020-15131 ◽

2020 ◽

Author(s):

Li Yang ◽

Qi Wang ◽

Yu Rao

Keyword(s):

Film Cooling ◽

Gas Turbines ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Data Sets ◽

Data Set ◽

Surface Areas ◽

Numerous Data ◽

Input Variables ◽

Complicated Conditions

Abstract Film Cooling is an important and widely used technology to protect hot sections of gas turbines. The last decades witnessed a fast growth of research and publications in the field of film cooling. However, except for the correlations for single row film cooling and the Seller correlation for cooling superposition, there were rarely generalized models for film cooling under superposition conditions. Meanwhile, the numerous data obtained for complex hole distributions were not emerged or integrated from different sources, and recent new data had no avenue to contribute to a compatible model. The technical barriers that obstructed the generalization of film cooling models are: a) the lack of a generalizable model; b) the large number of input variables to describe film cooling. The present study aimed at establishing a generalizable model to describe multiple row film cooling under a large parameter space, including hole locations, hole size, hole angles, blowing ratios etc. The method allowed data measured within different streamwise lengths and different surface areas to be integrated in a single model, in the form 1-D sequences. A Long Short Term Memory model was designed to model the local behavior of film cooling. Careful training, testing and validation were conducted to regress the model. The presented results showed that the method was accurate within the CFD data set generated in this study. The presented method could serve as a base model that allowed past and future film cooling research to contribute to a common data base. Meanwhile, the model could also be transferred from simulation data sets to experimental data sets using advanced machine learning algorithms in the future.

Download Full-text

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-1014-6 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 8

Author(s):

André M. Carrington ◽

Paul W. Fieguth ◽

Hammad Qazi ◽

Andreas Holzinger ◽

Helen H. Chen ◽

...

Keyword(s):

Roc Curve ◽

Diagnostic Testing ◽

Real Life ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Data Set ◽

Partial Auc ◽

C Statistic ◽

Future Work

Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.

Download Full-text

Migration with reduced artifacts from internal multiple reflections

Geophysics ◽

10.1190/geo2019-0780.1 ◽

2020 ◽

Vol 85 (4) ◽

pp. A25-A29

Author(s):

Lele Zhang

Keyword(s):

Seismic Reflection ◽

Computational Cost ◽

Data Sets ◽

Square Root ◽

Multiple Reflections ◽

Data Set ◽

Seismic Reflection Data ◽

Reflection Data ◽

Recent Developments ◽

Multiple Elimination

Migration of seismic reflection data leads to artifacts due to the presence of internal multiple reflections. Recent developments have shown that these artifacts can be avoided using Marchenko redatuming or Marchenko multiple elimination. These are powerful concepts, but their implementation comes at a considerable computational cost. We have derived a scheme to image the subsurface of the medium with significantly reduced computational cost and artifacts. This scheme is based on the projected Marchenko equations. The measured reflection response is required as input, and a data set with primary reflections and nonphysical primary reflections is created. Original and retrieved data sets are migrated, and the migration images are multiplied with each other, after which the square root is taken to give the artifact-reduced image. We showed the underlying theory and introduced the effectiveness of this scheme with a 2D numerical example.

Download Full-text

Complexity curve: a graphical measure of data complexity and classifier performance

10.7287/peerj.preprints.2095 ◽

2016 ◽

Author(s):

Julian Zubek ◽

Dariusz M Plewczynski

Keyword(s):

Learning Curve ◽

Performance Measure ◽

Machine Learning Algorithms ◽

Classification Error ◽

Data Sets ◽

Data Complexity ◽

Complexity Measures ◽

Data Set ◽

New Variant ◽

Classifier Performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. Contrary to some popular measures it is not focused on the shape of decision boundary in a classification task but on the amount of available data with respect to attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. We use it to propose a new variant of learning curve plot called generalisation curve. Generalisation curve is a standard learning curve with x-axis rescaled according to the data set complexity curve. It is a classifier performance measure, which shows how well the information present in the data is utilised. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining the performance of specific classifiers on these sets. Then we apply our methodology to a panel of benchmarks of standard machine learning algorithms on typical data sets, demonstrating how it can be used in practice to gain insights into data characteristics and classifier behaviour. Moreover, we show that complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without reducing classification accuracy. Associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).

Download Full-text

2. Fast Computation on Massive Data Sets

Statistics, Data Mining, and Machine Learning in Astronomy ◽

10.1515/9780691197050-003 ◽

2020 ◽

pp. 41-62

Keyword(s):

Massive Data ◽

Data Sets ◽

Fast Computation ◽

Massive Data Sets

Download Full-text

Zero-Shot Human Activity Recognition Using Non-Visual Sensors

Sensors ◽

10.3390/s20030825 ◽

2020 ◽

Vol 20 (3) ◽

pp. 825 ◽

Cited By ~ 3

Author(s):

Fadi Al Machot ◽

Mohammed R. Elkobaisi ◽

Kyandoghere Kyamakya

Keyword(s):

Activity Recognition ◽

High Performance ◽

Real Life ◽

Machine Learning Algorithms ◽

Training Data ◽

Sensor Data ◽

Sensor Technology ◽

Training Dataset ◽

Data Sets ◽

Data Set

Due to significant advances in sensor technology, studies towards activity recognition have gained interest and maturity in the last few years. Existing machine learning algorithms have demonstrated promising results by classifying activities whose instances have been already seen during training. Activity recognition methods based on real-life settings should cover a growing number of activities in various domains, whereby a significant part of instances will not be present in the training data set. However, to cover all possible activities in advance is a complex and expensive task. Concretely, we need a method that can extend the learning model to detect unseen activities without prior knowledge regarding sensor readings about those previously unseen activities. In this paper, we introduce an approach to leverage sensor data in discovering new unseen activities which were not present in the training set. We show that sensor readings can lead to promising results for zero-shot learning, whereby the necessary knowledge can be transferred from seen to unseen activities by using semantic similarity. The evaluation conducted on two data sets extracted from the well-known CASAS datasets show that the proposed zero-shot learning approach achieves a high performance in recognizing unseen (i.e., not present in the training dataset) new activities.

Download Full-text

Fast Computation on Massive Data Sets

Statistics, Data Mining, and Machine Learning in Astronomy ◽

10.2307/j.ctvrxk1hs.5 ◽

2019 ◽

pp. 41-62

Keyword(s):

Massive Data ◽

Data Sets ◽

Fast Computation ◽

Massive Data Sets

Download Full-text

Implementing Machine Learning Algorithms on Finite Element Analyses Data Sets for Selecting Proper Cellular Structure

International Journal of Applied Mechanics ◽

10.1142/s1758825121500721 ◽

2021 ◽

Author(s):

Mahziyar Darvishi ◽

Omid Ziaee ◽

Arash Rahmati ◽

Mohammad Silani

Keyword(s):

Machine Learning ◽

Finite Element ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Cellular Structures ◽

Data Sets ◽

K Nearest Neighbor ◽

Element Analysis ◽

Data Set ◽

Efficient Alternative

Numerous structure geometries are available for cellular structures, and selecting the suitable structure that reflects the intended characteristics is cumbersome. While testing many specimens for determining the mechanical properties of these materials could be time-consuming and expensive, finite element analysis (FEA) is considered an efficient alternative. In this study, we present a method to find the suitable geometry for the intended mechanical characteristics by implementing machine learning (ML) algorithms on FEA results of cellular structures. Different cellular structures of a given material are analyzed by FEA, and the results are validated with their corresponding analytical equations. The validated results are employed to create a data set used in the ML algorithms. Finally, by comparing the results with the correct answers, the most accurate algorithm is identified for the intended application. In our case study, the cellular structures are three widely used cellular structures as bone implants: Cube, Kelvin, and Rhombic dodecahedron, made of Ti–6Al–4V. The ML algorithms are simple Bayesian classification, K-nearest neighbor, XGBoost, random forest, and artificial neural network. By comparing the results of these algorithms, the best-performing algorithm is identified.

Download Full-text