Improving prediction models applied in systems monitoring natural hazards and machinery

Improving prediction models applied in systems monitoring natural hazards and machineryA method of combining three analytic techniques including regression rule induction, thek-nearest neighbors method and time series forecasting by means of the ARIMA methodology is presented. A decrease in the forecasting error while solving problems that concern natural hazards and machinery monitoring in coal mines was the main objective of the combined application of these techniques. The M5 algorithm was applied as a basic method of developing prediction models. In spite of an intensive development of regression rule induction algorithms and fuzzy-neural systems, the M5 algorithm is still characterized by the generalization ability and unbeatable time of data model creation competitive with other systems. In the paper, two solutions designed to decrease the mean square error of the obtained rules are presented. One consists in introducing into a set of conditional variables the so-called meta-variable (an analogy to constructive induction) whose values are determined by an autoregressive or the ARIMA model. The other shows that limitation of a data set on which the M5 algorithm operates by thek-nearest neighbor method can also lead to error decreasing. Moreover, three application examples of the presented solutions for data collected by systems of natural hazards and machinery monitoring in coal mines are described. In Appendix, results of several benchmark data sets analyses are given as a supplement of the presented results.

Download Full-text

A simple clustering technique to extract subsets of data for function approximation

Journal of Hydroinformatics ◽

10.2166/hydro.2015.065 ◽

2015 ◽

Vol 17 (5) ◽

pp. 719-732

Author(s):

Dulakshi Santhusitha Kumari Karunasingha ◽

Shie-Yui Liong

Keyword(s):

Function Approximation ◽

Prediction Models ◽

Data Extraction ◽

Single Parameter ◽

Subtractive Clustering ◽

Data Sets ◽

Clustering Methods ◽

Clustering Method ◽

Data Set ◽

Functional Relationships

A simple clustering method is proposed for extracting representative subsets from lengthy data sets. The main purpose of the extracted subset of data is to use it to build prediction models (of the form of approximating functional relationships) instead of using the entire large data set. Such smaller subsets of data are often required in exploratory analysis stages of studies that involve resource consuming investigations. A few recent studies have used a subtractive clustering method (SCM) for such data extraction, in the absence of clustering methods for function approximation. SCM, however, requires several parameters to be specified. This study proposes a clustering method, which requires only a single parameter to be specified, yet it is shown to be as effective as the SCM. A method to find suitable values for the parameter is also proposed. Due to having only a single parameter, using the proposed clustering method is shown to be orders of magnitudes more efficient than using SCM. The effectiveness of the proposed method is demonstrated on phase space prediction of three univariate time series and prediction of two multivariate data sets. Some drawbacks of SCM when applied for data extraction are identified, and the proposed method is shown to be a solution for them.

Download Full-text

Determination of Reactivity Ratios from Binary Copolymerization Using the k-Nearest Neighbor Non-Parametric Regression

Polymers ◽

10.3390/polym13213811 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3811

Author(s):

Iosif Sorin Fazakas-Anca ◽

Arina Modrea ◽

Sorin Vlase

Keyword(s):

Experimental Data ◽

Nearest Neighbor ◽

Optimization Method ◽

Reactivity Ratios ◽

Data Sets ◽

K Nearest Neighbor ◽

Integration Algorithm ◽

Data Set ◽

Parametric Regression ◽

Non Parametric

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Scalable Non-Parametric Methods for Large Data Sets

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch260 ◽

2011 ◽

pp. 1708-1713

Author(s):

V. Suresh Babu ◽

P. Viswanath ◽

Narasimha M. Murty

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parametric Methods ◽

Clustering Method ◽

Data Set ◽

Computational Burden ◽

Set Size ◽

Non Parametric

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.

Download Full-text

A Research Travelogue on Classification Algorithms using R Programming

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d9014.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 9155-9158

Keyword(s):

Machine Learning ◽

Nearest Neighbor ◽

Statistical Tests ◽

Learning Task ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Domain Experts ◽

R Programming ◽

Training Examples

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.

Download Full-text

Solving the Problem of Assessing Synergy and Antagonism for Non-Traditional Dosing Curve Compounds Using the DE/ZI Method: Application to Nrf2 Activators

Frontiers in Pharmacology ◽

10.3389/fphar.2021.686201 ◽

2021 ◽

Vol 12 ◽

Author(s):

Elizabeth M. Repash ◽

Kaitlin M. Pensabene ◽

Peter M. Palenchar ◽

Aimee L. Eggler

Keyword(s):

Dose Response ◽

Nearest Neighbor ◽

Antioxidant Response ◽

Pharmacological Intervention ◽

Data Sets ◽

Response Curves ◽

Data Set ◽

Dose Response Curves ◽

Small Molecule Modulators ◽

Antagonistic Interactions

Multi-drug combination therapy carries significant promise for pharmacological intervention, primarily better efficacy with less toxicity and fewer side effects. However, the field lacks methodology to assess synergistic or antagonistic interactions for drugs with non-traditional dose response curves. Specifically, our goal was to assess small-molecule modulators of antioxidant response element (ARE)-driven gene expression, which is largely regulated by the Nrf2 transcription factor. Known as Nrf2 activators, this class of compounds upregulates a battery of cytoprotective genes and shows significant promise for prevention of numerous chronic diseases. For example, sulforaphane sourced from broccoli sprouts is the subject of over 70 clinical trials. Nrf2 activators generally have non-traditional dose response curves that are hormetic, or U-shaped. We introduce a method based on the principles of Loewe Additivity to assess synergism and antagonism for two compounds in combination. This method, termed Dose-Equivalence/Zero Interaction (DE/ZI), can be used with traditional Hill-slope response curves, and it also can assess interactions for compounds with non-traditional curves, using a nearest-neighbor approach. Using a Monte-Carlo method, DE/ZI generates a measure of synergy or antagonism for each dosing pair with an associated error and p-value, resulting in a 3D response surface. For the assessed Nrf2 activators, sulforaphane and di-tert-butylhydroquinone, this approach revealed synergistic interactions at higher dosing concentrations consistently across data sets and potential antagonistic interactions at lower concentrations. DE/ZI eliminates the need to determine the best fit equation for a given data set and values experimentally-derived results over formulated fits.

Download Full-text

Software Defect Density Analysis

10.29007/rh9l ◽

2019 ◽

Author(s):

Cuauhtémoc López-Martín

Keyword(s):

Prediction Models ◽

Defect Density ◽

Data Sets ◽

Software Projects ◽

Data Set ◽

Language Generation ◽

Development Platform ◽

Software Defect ◽

Density Analysis ◽

Lower Size

Defect density (DD) is a measure to determine the effectiveness of software processes. DD is defined as the total number of defects divided by the size of the software. Software prediction is an activity of software planning. This study is related to the analysis of attributes of data sets commonly used for building DD prediction models. The data sets of software projects were selected from the International Software Benchmarking Standards Group (ISBSG) Release 2018. The selection criteria were based on attributes such as type of development, development platform, and programming language generation as suggested by the ISBSG. Since a lower size of data set is generated as mentioned criteria are observed, it avoids a good generalization for models. Therefore, in this study, a statistical analysis of data sets was performed with the objective of knowing if they could be pooled instead of using them as separated data sets. Results showed that there was no difference among the DD of new projects nor among the DD of enhancement projects, but there was a difference between the DD of new and enhancement projects. Results suggest that prediction models can separately be constructed for new projects and enhancement projects, but not by pooling new and enhancement ones.

Download Full-text

Implementing Machine Learning Algorithms on Finite Element Analyses Data Sets for Selecting Proper Cellular Structure

International Journal of Applied Mechanics ◽

10.1142/s1758825121500721 ◽

2021 ◽

Author(s):

Mahziyar Darvishi ◽

Omid Ziaee ◽

Arash Rahmati ◽

Mohammad Silani

Keyword(s):

Machine Learning ◽

Finite Element ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Cellular Structures ◽

Data Sets ◽

K Nearest Neighbor ◽

Element Analysis ◽

Data Set ◽

Efficient Alternative

Numerous structure geometries are available for cellular structures, and selecting the suitable structure that reflects the intended characteristics is cumbersome. While testing many specimens for determining the mechanical properties of these materials could be time-consuming and expensive, finite element analysis (FEA) is considered an efficient alternative. In this study, we present a method to find the suitable geometry for the intended mechanical characteristics by implementing machine learning (ML) algorithms on FEA results of cellular structures. Different cellular structures of a given material are analyzed by FEA, and the results are validated with their corresponding analytical equations. The validated results are employed to create a data set used in the ML algorithms. Finally, by comparing the results with the correct answers, the most accurate algorithm is identified for the intended application. In our case study, the cellular structures are three widely used cellular structures as bone implants: Cube, Kelvin, and Rhombic dodecahedron, made of Ti–6Al–4V. The ML algorithms are simple Bayesian classification, K-nearest neighbor, XGBoost, random forest, and artificial neural network. By comparing the results of these algorithms, the best-performing algorithm is identified.

Download Full-text

A Further Investigation on the Application of Critical Pore Size as an Approach for Reservoir Rock Typing

Journal of Energy Resources Technology ◽

10.1115/1.4049735 ◽

2021 ◽

Vol 143 (11) ◽

Author(s):

Mohsen Faramarzi-Palangar ◽

Behnam Sedaee ◽

Mohammad Emami Niri

Keyword(s):

Pore Size ◽

Constant Coefficient ◽

Prediction Models ◽

Critical Role ◽

Development Planning ◽

Reservoir Rock ◽

Data Sets ◽

Data Set ◽

Rock Typing ◽

Correct Definition

Abstract The correct definition of rock types plays a critical role in reservoir characterization, simulation, and field development planning. In this study, we use the critical pore size (linf) as an approach for reservoir rock typing. Two linf relations were separately derived based on two permeability prediction models and then merged together to drive a generalized linf relation. The proposed rock typing methodology includes two main parts: in the first part, we determine an appropriate constant coefficient, and in the second part, we perform reservoir rock typing based on two different scenarios. The first scenario is based on the forming groups of rocks using statistical analysis, and the second scenario is based on the forming groups of rocks with similar capillary pressure curves. This approach was applied to three data sets. In detail, two data sets were used to determine the constant coefficient, and one data set was used to show the applicability of the linf method in comparison with FZI for rock typing.

Download Full-text

Genomic Prediction Enhanced Sparse Testing for Multi-environment Trials

G3 Genes|Genome|Genetics ◽

10.1534/g3.120.401349 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2725-2739 ◽

Cited By ~ 1

Author(s):

Diego Jarquin ◽

Reka Howard ◽

Jose Crossa ◽

Yoseph Beyene ◽

Manje Gowda ◽

...

Keyword(s):

Prediction Accuracy ◽

Prediction Models ◽

Prediction Method ◽

Predictive Ability ◽

Fixed Number ◽

Substantial Part ◽

Data Sets ◽

Environment Interaction ◽

Data Set ◽

Hybrid Data

“Sparse testing” refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.

Download Full-text