An Instance Space Analysis of Regression Problems

2021 ◽  
Vol 15 (2) ◽  
pp. 1-25
Author(s):  
Mario Andrés Muñoz ◽  
Tao Yan ◽  
Matheus R. Leal ◽  
Kate Smith-Miles ◽  
Ana Carolina Lorena ◽  
...  

The quest for greater insights into algorithm strengths and weaknesses, as revealed when studying algorithm performance on large collections of test problems, is supported by interactive visual analytics tools. A recent advance is Instance Space Analysis, which presents a visualization of the space occupied by the test datasets, and the performance of algorithms across the instance space. The strengths and weaknesses of algorithms can be visually assessed, and the adequacy of the test datasets can be scrutinized through visual analytics. This article presents the first Instance Space Analysis of regression problems in Machine Learning, considering the performance of 14 popular algorithms on 4,855 test datasets from a variety of sources. The two-dimensional instance space is defined by measurable characteristics of regression problems, selected from over 26 candidate features. It enables the similarities and differences between test instances to be visualized, along with the predictive performance of regression algorithms across the entire instance space. The purpose of creating this framework for visual analysis of an instance space is twofold: one may assess the capability and suitability of various regression techniques; meanwhile the bias, diversity, and level of difficulty of the regression problems popularly used by the community can be visually revealed. This article shows the applicability of the created regression instance space to provide insights into the strengths and weaknesses of regression algorithms, and the opportunities to diversify the benchmark test instances to support greater insights.

Obesity Facts ◽  
2021 ◽  
pp. 1-11
Author(s):  
Marijn Marthe Georgine van Berckel ◽  
Saskia L.M. van Loon ◽  
Arjen-Kars Boer ◽  
Volkher Scharnhorst ◽  
Simon W. Nienhuijs

<b><i>Introduction:</i></b> Bariatric surgery results in both intentional and unintentional metabolic changes. In a high-volume bariatric center, extensive laboratory panels are used to monitor these changes pre- and postoperatively. Consecutive measurements of relevant biochemical markers allow exploration of the health state of bariatric patients and comparison of different patient groups. <b><i>Objective:</i></b> The objective of this study is to compare biomarker distributions over time between 2 common bariatric procedures, i.e., sleeve gastrectomy (SG) and gastric bypass (RYGB), using visual analytics. <b><i>Methods:</i></b> Both pre- and postsurgical (6, 12, and 24 months) data of all patients who underwent primary bariatric surgery were collected retrospectively. The distribution and evolution of different biochemical markers were compared before and after surgery using asymmetric beanplots in order to evaluate the effect of primary SG and RYGB. A beanplot is an alternative to the boxplot that allows an easy and thorough visual comparison of univariate data. <b><i>Results:</i></b> In total, 1,237 patients (659 SG and 578 RYGB) were included. The sleeve and bypass groups were comparable in terms of age and the prevalence of comorbidities. The mean presurgical BMI and the percentage of males were higher in the sleeve group. The effect of surgery on lowering of glycated hemoglobin was similar for both surgery types. After RYGB surgery, the decrease in the cholesterol concentration was larger than after SG. The enzymatic activity of aspartate aminotransferase, alanine aminotransferase, and alkaline phosphate in sleeve patients was higher presurgically but lower postsurgically compared to bypass values. <b><i>Conclusions:</i></b> Beanplots allow intuitive visualization of population distributions. Analysis of this large population-based data set using beanplots suggests comparable efficacies of both types of surgery in reducing diabetes. RYGB surgery reduced dyslipidemia more effectively than SG. The trend toward a larger decrease in liver enzyme activities following SG is a subject for further investigation.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ratanond Koonchanok ◽  
Swapna Vidhur Daulatabad ◽  
Quoseena Mir ◽  
Khairi Reda ◽  
Sarath Chandra Janga

Abstract Background Direct-sequencing technologies, such as Oxford Nanopore’s, are delivering long RNA reads with great efficacy and convenience. These technologies afford an ability to detect post-transcriptional modifications at a single-molecule resolution, promising new insights into the functional roles of RNA. However, realizing this potential requires new tools to analyze and explore this type of data. Result Here, we present Sequoia, a visual analytics tool that allows users to interactively explore nanopore sequences. Sequoia combines a Python-based backend with a multi-view visualization interface, enabling users to import raw nanopore sequencing data in a Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to identify properties of interest. We demonstrate the application of Sequoia by generating and analyzing ~ 500k reads from direct RNA sequencing data of human HeLa cell line. We focus on comparing signal features from m6A and m5C RNA modifications as the first step towards building automated classifiers. We show how, through iterative visual exploration and tuning of dimensionality reduction parameters, we can separate modified RNA sequences from their unmodified counterparts. We also document new, qualitative signal signatures that characterize these modifications from otherwise normal RNA bases, which we were able to discover from the visualization. Conclusions Sequoia’s interactive features complement existing computational approaches in nanopore-based RNA workflows. The insights gleaned through visual analysis should help users in developing rationales, hypotheses, and insights into the dynamic nature of RNA. Sequoia is available at https://github.com/dnonatar/Sequoia.


2021 ◽  
Vol 11 (11) ◽  
pp. 4751
Author(s):  
Jorge-Félix Rodríguez-Quintero ◽  
Alexander Sánchez-Díaz ◽  
Leonel Iriarte-Navarro ◽  
Alejandro Maté ◽  
Manuel Marco-Such ◽  
...  

Among the knowledge areas in which process mining has had an impact, the audit domain is particularly striking. Traditionally, audits seek evidence in a data sample that allows making inferences about a population. Mistakes are usually committed when generalizing the results and anomalies; therefore, they appear in unprocessed sets; however, there are some efforts to address these limitations using process-mining-based approaches for fraud detection. To the best of our knowledge, no fraud audit method exists that combines process mining techniques and visual analytics to identify relevant patterns. This paper presents a fraud audit approach based on the combination of process mining techniques and visual analytics. The main advantages are: (i) a method is included that guides the use of the visual capabilities of process mining to detect fraud data patterns during an audit; (ii) the approach can be generalized to any business domain; (iii) well-known process mining techniques are used (dotted chart, trace alignment, fuzzy miner…). The techniques were selected by a group of experts and were extended to enable filtering for contextual analysis, to handle levels of process abstraction, and to facilitate implementation in the area of fraud audits. Based on the proposed approach, we developed a software solution that is currently being used in the financial sector as well as in the telecommunications and hospitality sectors. Finally, for demonstration purposes, we present a real hotel management use case in which we detected suspected fraud behaviors, thus validating the effectiveness of the approach.


2019 ◽  
Vol 19 (1) ◽  
pp. 3-23
Author(s):  
Aurea Soriano-Vargas ◽  
Bernd Hamann ◽  
Maria Cristina F de Oliveira

We present an integrated interactive framework for the visual analysis of time-varying multivariate data sets. As part of our research, we performed in-depth studies concerning the applicability of visualization techniques to obtain valuable insights. We consolidated the considered analysis and visualization methods in one framework, called TV-MV Analytics. TV-MV Analytics effectively combines visualization and data mining algorithms providing the following capabilities: (1) visual exploration of multivariate data at different temporal scales, and (2) a hierarchical small multiples visualization combined with interactive clustering and multidimensional projection to detect temporal relationships in the data. We demonstrate the value of our framework for specific scenarios, by studying three use cases that were validated and discussed with domain experts.


Author(s):  
Prashant Rai ◽  
Mathilde Chevreuil ◽  
Anthony Nouy ◽  
Jayant Sen Gupta

This paper aims at handling high dimensional uncertainty propagation problems by proposing a tensor product approximation method based on regression techniques. The underlying assumption is that the model output functional can be well represented in a separated form, as a sum of elementary tensors in the stochastic tensor product space. The proposed method consists in constructing a tensor basis with a greedy algorithm and then in computing an approximation in the generated approximation space using regression with sparse regularization. Using appropriate regularization techniques, the regression problems are well posed for only few sample evaluations and they provide accurate approximations of model outputs.


2021 ◽  
Author(s):  
Taimur Khan ◽  
Syed Samad Shakeel ◽  
Afzal Gul ◽  
Hamza Masud ◽  
Achim Ebert

Visual analytics has been widely studied in the past decade both in academia and industry to improve data exploration, minimize the overall cost, and improve data analysis. In this chapter, we explore the idea of visual analytics in the context of simulation data. This would then provide us with the capability to not only explore our data visually but also to apply machine learning models in order to answer high-level questions with respect to scheduling, choosing optimal simulation parameters, finding correlations, etc. More specifically, we examine state-of-the-art tools to be able to perform these above-mentioned tasks. Further, to test and validate our methodology we followed the human-centered design process to build a prototype tool called ViDAS (Visual Data Analytics of Simulated Data). Our preliminary evaluation study illustrates the intuitiveness and ease-of-use of our approach with regards to visual analysis of simulated data.


Complexity ◽  
2022 ◽  
Vol 2022 ◽  
pp. 1-11
Author(s):  
Marium Mehmood ◽  
Nasser Alshammari ◽  
Saad Awadh Alanazi ◽  
Fahad Ahmad

The liver is the human body’s mandatory organ, but detecting liver disease at an early stage is very difficult due to the hiddenness of symptoms. Liver diseases may cause loss of energy or weakness when some irregularities in the working of the liver get visible. Cancer is one of the most common diseases of the liver and also the most fatal of all. Uncontrolled growth of harmful cells is developed inside the liver. If diagnosed late, it may cause death. Treatment of liver diseases at an early stage is, therefore, an important issue as is designing a model to diagnose early disease. Firstly, an appropriate feature should be identified which plays a more significant part in the detection of liver cancer at an early stage. Therefore, it is essential to extract some essential features from thousands of unwanted features. So, these features will be mined using data mining and soft computing techniques. These techniques give optimized results that will be helpful in disease diagnosis at an early stage. In these techniques, we use feature selection methods to reduce the dataset’s feature, which include Filter, Wrapper, and Embedded methods. Different Regression algorithms are then applied to these methods individually to evaluate the result. Regression algorithms include Linear Regression, Ridge Regression, LASSO Regression, Support Vector Regression, Decision Tree Regression, Multilayer Perceptron Regression, and Random Forest Regression. Based on the accuracy and error rates generated by these Regression algorithms, we have evaluated our results. The result shows that Random Forest Regression with the Wrapper Method from all the deployed Regression techniques is the best and gives the highest R2-Score of 0.8923 and lowest MSE of 0.0618.


Author(s):  
K. Darshana Abeyrathna ◽  
Ole-Christoffer Granmo ◽  
Xuan Zhang ◽  
Lei Jiao ◽  
Morten Goodwin

Relying simply on bitwise operators, the recently introduced Tsetlin machine (TM) has provided competitive pattern classification accuracy in several benchmarks, including text understanding. In this paper, we introduce the regression Tsetlin machine (RTM), a new class of TMs designed for continuous input and output, targeting nonlinear regression problems. In all brevity, we convert continuous input into a binary representation based on thresholding, and transform the propositional formula formed by the TM into an aggregated continuous output. Our empirical comparison of the RTM with state-of-the-art regression techniques reveals either superior or on par performance on five datasets. This article is part of the theme issue ‘Harmonizing energy-autonomous computing and intelligence’.


2013 ◽  
Vol 22 (05) ◽  
pp. 1360008 ◽  
Author(s):  
PATRICIA J. CROSSNO ◽  
ANDREW T. WILSON ◽  
TIMOTHY M. SHEAD ◽  
WARREN L. DAVIS ◽  
DANIEL M. DUNLAVY

We present a new approach for analyzing topic models using visual analytics. We have developed TopicView, an application for visually comparing and exploring multiple models of text corpora, as a prototype for this type of analysis tool. TopicView uses multiple linked views to visually analyze conceptual and topical content, document relationships identified by models, and the impact of models on the results of document clustering. As case studies, we examine models created using two standard approaches: Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Conceptual content is compared through the combination of (i) a bipartite graph matching LSA concepts with LDA topics based on the cosine similarities of model factors and (ii) a table containing the terms for each LSA concept and LDA topic listed in decreasing order of importance. Document relationships are examined through the combination of (i) side-by-side document similarity graphs, (ii) a table listing the weights for each document's contribution to each concept/topic, and (iii) a full text reader for documents selected in either of the graphs or the table. The impact of LSA and LDA models on document clustering applications is explored through similar means, using proximities between documents and cluster exemplars for graph layout edge weighting and table entries. We demonstrate the utility of TopicView's visual approach to model assessment by comparing LSA and LDA models of several example corpora.


2021 ◽  
pp. 1-24
Author(s):  
G. Kronberger ◽  
F. O. de Franca ◽  
B. Burlacu ◽  
C. Haider ◽  
M. Kommenda

Abstract We investigate the addition of constraints on the function image and its derivatives for the incorporation of prior knowledge in symbolic regression. The approach is called shape-constrained symbolic regression and allows us to enforce e.g. monotonicity of the function over selected inputs. The aim is to find models which conform to expected behaviour and which have improved extrapolation capabilities. We demonstrate the feasibility of the idea and propose and compare two evolutionary algorithms for shapeconstrained symbolic regression: i) an extension of tree-based genetic programming which discards infeasible solutions in the selection step, and ii) a two population evolutionary algorithm that separates the feasible from the infeasible solutions. In both algorithms we use interval arithmetic to approximate bounds for models and their partial derivatives. The algorithms are tested on a set of 19 synthetic and four real-world regression problems. Both algorithms are able to identify models which conform to shape constraints which is not the case for the unmodified symbolic regression algorithms. However, the predictive accuracy of models with constraints is worse on the training set and the test set. Shape-constrained polynomial regression produces the best results for the test set but also significantly larger models.


Sign in / Sign up

Export Citation Format

Share Document