Visualizing Profiles of Large Datasets of Weighted and Mixed Data

Aurea Grané; Alpha A. Sow-Barry

doi:10.3390/math9080891

Visualizing Profiles of Large Datasets of Weighted and Mixed Data

Mathematics ◽

10.3390/math9080891 ◽

2021 ◽

Vol 9 (8) ◽

pp. 891

Author(s):

Aurea Grané ◽

Alpha A. Sow-Barry

Keyword(s):

Multidimensional Scaling ◽

Random Sample ◽

Simulation Study ◽

Clustering Algorithm ◽

Computational Cost ◽

Interpolation Formula ◽

Large Datasets ◽

Mixed Data ◽

Multivariate Techniques ◽

High Computational Cost

This work provides a procedure with which to construct and visualize profiles, i.e., groups of individuals with similar characteristics, for weighted and mixed data by combining two classical multivariate techniques, multidimensional scaling (MDS) and the k-prototypes clustering algorithm. The well-known drawback of classical MDS in large datasets is circumvented by selecting a small random sample of the dataset, whose individuals are clustered by means of an adapted version of the k-prototypes algorithm and mapped via classical MDS. Gower’s interpolation formula is used to project remaining individuals onto the previous configuration. In all the process, Gower’s distance is used to measure the proximity between individuals. The methodology is illustrated on a real dataset, obtained from the Survey of Health, Ageing and Retirement in Europe (SHARE), which was carried out in 19 countries and represents over 124 million aged individuals in Europe. The performance of the method was evaluated through a simulation study, whose results point out that the new proposal solves the high computational cost of the classical MDS with low error.

Instance-based classification using prototypes generated from large noisy and streaming datasets

Computer Science and Information Systems ◽

10.2298/csis190518044o ◽

2020 ◽

Vol 17 (1) ◽

pp. 71-92

Author(s):

Stefanos Ougiaroglou ◽

Dimitris Dervos ◽

Georgios Evangelidis

Keyword(s):

Data Streams ◽

Computational Cost ◽

Large Datasets ◽

Training Data ◽

Small Subset ◽

Prototype Selection ◽

The Past ◽

Large Sets ◽

Noise Tolerant ◽

High Computational Cost

Nowadays, large volumes of training data are available from various data sources and streaming environments. Instance-based classifiers perform adequately when they use only a small subset of such datasets. Larger data volumes introduce high computational cost that prohibits the timely execution of the classification process. Conventional prototype selection and generation algorithms are also inappropriate for data streams and large datasets. In the past, we proposed prototype generation algorithms that maintain a dynamic set of prototypes and are appropriate for such types of data. Dynamic because existing prototypes may be updated, or new prototypes may be appended to the set of prototypes in the course of processing. Still, repetitive generation of new prototypes may result to forming unpredictably large sets of prototypes. In this paper, we propose a new variation of our algorithm that maintains the prototypes in a convenient and manageable way. This is achieved by removing the weakest prototype when a new prototype is generated. The new algorithm has been tested on several datasets. The experimental results reveal that it is as accurate as its predecessor, yet it is more efficient and noise tolerant.

An Alternate Algorithm for (3x3) Median Filtering of Digital Images

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v1i1.6732 ◽

2012 ◽

Vol 2 (1) ◽

pp. 7-9 ◽

Cited By ~ 2

Author(s):

Satinderjit Singh

Keyword(s):

Median Filter ◽

Computational Cost ◽

Spatial Coherence ◽

General Purpose ◽

Median Filtering ◽

Basic Algorithm ◽

Temporal Complexity ◽

Filter Kernel ◽

One Step ◽

High Computational Cost

Median filtering is a commonly used technique in image processing. The main problem of the median filter is its high computational cost (for sorting N pixels, the temporal complexity is O(NÂ·log N), even with the most efficient sorting algorithms). When the median filter must be carried out in real time, the software implementation in general-purpose processorsdoes not usually give good results. This Paper presents an efficient algorithm for median filtering with a 3x3 filter kernel with only about 9 comparisons per pixel using spatial coherence between neighboring filter computations. The basic algorithm calculates two medians in one step and reuses sorted slices of three vertical neighboring pixels. An extension of this algorithm for 2D spatial coherence is also examined, which calculates four medians per step.

Methods for studying dissolved oxygen levels in coastal and estuarine waters receiving combined sewer overflows

Water Science & Technology ◽

10.2166/wst.1995.0081 ◽

1995 ◽

Vol 32 (2) ◽

pp. 95-103

Author(s):

José A. Revilla ◽

Kalin N. Koev ◽

Rafael Díaz ◽

César Álvarez ◽

Antonio Roldán

Keyword(s):

Dissolved Oxygen ◽

Coastal Waters ◽

Computational Cost ◽

Oxygen Deficit ◽

Alternative Methods ◽

Coastal Zones ◽

Combined Sewer Overflows ◽

Sewer Systems ◽

Combined Sewer ◽

High Computational Cost

One factor in determining the transport capacity of coastal interceptors in Combined Sewer Systems (CSS) is the reduction of Dissolved Oxygen (DO) in coastal waters originating from the overflows. The study of the evolution of DO in coastal zones is complex. The high computational cost of using mathematical models discriminates against the required probabilistic analysis being undertaken. Alternative methods, based on such mathematical modelling, employed in a limited number of cases, are therefore needed. In this paper two alternative methods are presented for the study of oxygen deficit resulting from overflows of CSS. In the first, statistical analyses focus on the causes of the deficit (the volume discharged). The second concentrates on the effects (the concentrations of oxygen in the sea). Both methods have been applied in a study of the coastal interceptor at Pasajes Estuary (Guipúzcoa, Spain) with similar results.

Reliability and reliability-based sensitivity analysis of self-centering buckling restrained braces using meta-models

Journal of Intelligent Material Systems and Structures ◽

10.1177/1045389x211026382 ◽

2021 ◽

pp. 1045389X2110263

Author(s):

Seyede Vahide Hashemi ◽

Mahmoud Miri ◽

Mohsen Rashki ◽

Sadegh Etedali

Keyword(s):

Failure Probability ◽

Limit State ◽

Computational Cost ◽

Sensitivity Analyses ◽

State Function ◽

Reliability Indices ◽

Buckling Restrained Brace ◽

Polynomial Response Surface ◽

Nonlinear Dynamic Analyses ◽

High Computational Cost

This paper aims to carry out sensitivity analyses to study how the effect of each design variable on the performance of self-centering buckling restrained brace (SC-BRB) and the corresponding buckling restrained brace (BRB) without shape memory alloy (SMA) rods. Furthermore, the reliability analyses of BRB and SC-BRB are performed in this study. Considering the high computational cost of the simulation methods, three Meta-models including the Kriging, radial basis function (RBF), and polynomial response surface (PRSM) are utilized to construct the surrogate models. For this aim, the nonlinear dynamic analyses are conducted on both BRB and SC-BRB by using OpenSees software. The results showed that the SMA area, SMA length ratio, and BRB core area have the most effect on the failure probability of SC-BRB. It is concluded that Kriging-based Monte Carlo Simulation (MCS) gives the best performance to estimate the limit state function (LSF) of BRB and SC-BRB in the reliability analysis procedures. Considering the effects of changing the maximum cyclic loading on the failure probability computation and comparison of the failure probability for different LSFs, it is also found that the reliability indices of SC-BRB were always higher than the corresponding reliability indices determined for BRB which confirms the performance superiority of SC-BRB than BRB.

A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

Journal Of Big Data ◽

10.1186/s40537-019-0259-3 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Sumedh Yadav ◽

Mathis Bode

Keyword(s):

Prediction Accuracy ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Large Datasets ◽

Test Results ◽

Accuracy Test ◽

Reasonable Proportion ◽

Speed Up ◽

Run Time ◽

Information Graph

Abstract A scalable graphical method is presented for selecting and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is succeeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method consists of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is a significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristics available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for a partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

Non-parallel dictionary learning for voice conversion using non-negative Tucker decomposition

EURASIP Journal on Audio Speech and Music Processing ◽

10.1186/s13636-019-0160-1 ◽

2019 ◽

Vol 2019 (1) ◽

Author(s):

Yuki Takashima ◽

Toru Nakashika ◽

Tetsuya Takiguchi ◽

Yasuo Ariki

Keyword(s):

Dictionary Learning ◽

Computational Cost ◽

Tensor Decomposition ◽

Gaussian Mixture ◽

Voice Conversion ◽

Specific Information ◽

Learning Method ◽

Tucker Decomposition ◽

Parallel Data ◽

High Computational Cost

Abstract Voice conversion (VC) is a technique of exclusively converting speaker-specific information in the source speech while preserving the associated phonemic information. Non-negative matrix factorization (NMF)-based VC has been widely researched because of the natural-sounding voice it achieves when compared with conventional Gaussian mixture model-based VC. In conventional NMF-VC, models are trained using parallel data which results in the speech data requiring elaborate pre-processing to generate parallel data. NMF-VC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionary-learning method using non-negative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTD-based dictionary-learning method estimates the dictionary matrix for NMF-VC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and non-parallel settings.

A NOTE ON PHASING LONG GENOMIC REGIONS USING LOCAL HAPLOTYPE PREDICTIONS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002272 ◽

2006 ◽

Vol 04 (03) ◽

pp. 639-647 ◽

Cited By ~ 6

Author(s):

ELEAZAR ESKIN ◽

RODED SHARAN ◽

ERAN HALPERIN

Keyword(s):

Large Scale ◽

Computational Cost ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Novel Approach ◽

Maximum Likelihood Criterion ◽

The Common ◽

Genomic Regions ◽

High Computational Cost ◽

Combining Information

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .

MULTIVARIATE ANALYSIS OF PERCEIVED DYSFUNCTION RATINGS OF PERSONALITY DISORDER SYMPTOMS

Social Behavior and Personality An International Journal ◽

10.2224/sbp.2004.32.6.595 ◽

2004 ◽

Vol 32 (6) ◽

pp. 595-606

Author(s):

David C. Watson ◽

Andrew J. Howell

Keyword(s):

Factor Analysis ◽

Multivariate Analysis ◽

Personality Disorder ◽

Multidimensional Scaling ◽

Personality Disorders ◽

Two Dimensions ◽

Social Impairment ◽

Multivariate Techniques ◽

Personal Distress ◽

High Distress

Dysfunction in personality disorder symptoms was assessed using multivariate techniques to analyse lay judges' (N = 216) ratings of occupational impairment, social impairment, and personal distress. Factor analysis revealed that ratings of occupational impairment and social impairment loaded onto distinct factors. Personal distress ratings loaded onto two separate factors: high distress and low distress. Multidimensional scaling revealed two dimensions for overall dysfunction among personality disorders: severity of dysfunction and internalization-externalization. The dimensions were independence-dependence and severity of dysfunction for occupational impairment, interpersonal involvement and dominance-submission for social impairment, and internalization-externalization and severity for personal distress.

Plant Leaf Disease Recognition Using Depth-Wise Separable Convolution-Based Models

Symmetry ◽

10.3390/sym13030511 ◽

2021 ◽

Vol 13 (3) ◽

pp. 511

Author(s):

Syed Mohammad Minhaz Hossain ◽

Kaushik Deb ◽

Pranab Kumar Dhar ◽

Takeshi Koshiba

Keyword(s):

State Of The Art ◽

Computational Cost ◽

Region Of Interest ◽

Number Of Clusters ◽

Plant Leaf ◽

Leaf Disease ◽

Automatic Initialization ◽

Adequate Accuracy ◽

Model Size ◽

High Computational Cost

Proper plant leaf disease (PLD) detection is challenging in complex backgrounds and under different capture conditions. For this reason, initially, modified adaptive centroid-based segmentation (ACS) is used to trace the proper region of interest (ROI). Automatic initialization of the number of clusters (K) using modified ACS before recognition increases tracing ROI’s scalability even for symmetrical features in various plants. Besides, convolutional neural network (CNN)-based PLD recognition models achieve adequate accuracy to some extent. However, memory requirements (large-scaled parameters) and the high computational cost of CNN-based PLD models are burning issues for the memory restricted mobile and IoT-based devices. Therefore, after tracing ROIs, three proposed depth-wise separable convolutional PLD (DSCPLD) models, such as segmented modified DSCPLD (S-modified MobileNet), segmented reduced DSCPLD (S-reduced MobileNet), and segmented extended DSCPLD (S-extended MobileNet), are utilized to represent the constructive trade-off among accuracy, model size, and computational latency. Moreover, we have compared our proposed DSCPLD recognition models with state-of-the-art models, such as MobileNet, VGG16, VGG19, and AlexNet. Among segmented-based DSCPLD models, S-modified MobileNet achieves the best accuracy of 99.55% and F1-sore of 97.07%. Besides, we have simulated our DSCPLD models using both full plant leaf images and segmented plant leaf images and conclude that, after using modified ACS, all models increase their accuracy and F1-score. Furthermore, a new plant leaf dataset containing 6580 images of eight plants was used to experiment with several depth-wise separable convolution models.

Heterogeneous Influence Maximization Through Community Detection in Social Networks

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2021100107 ◽

2021 ◽

Vol 12 (4) ◽

pp. 118-131

Author(s):

Jaya Krishna Raguru ◽

Devi Prasad Sharma

Keyword(s):

Community Detection ◽

Greedy Algorithms ◽

Computational Cost ◽

Optimal Solution ◽

Influence Maximization ◽

Centrality Measures ◽

Influence Spread ◽

Real World Datasets ◽

Initial Seed ◽

High Computational Cost

The problem of identifying a seed set composed of K nodes that increase influence spread over a social network is known as influence maximization (IM). Past works showed this problem to be NP-hard and an optimal solution to this problem using greedy algorithms achieved only 63% of spread. However, this approach is expensive and suffered from performance issues like high computational cost. Furthermore, in a network with communities, IM spread is not always certain. In this paper, heterogeneous influence maximization through community detection (HIMCD) algorithm is proposed. This approach addresses initial seed nodes selection in communities using various centrality measures, and these seed nodes act as sources for influence spread. A parallel influence maximization is applied with the aid of seed node set contained in each group. In this approach, graph is partitioned and IM computations are done in a distributed manner. Extensive experiments with two real-world datasets reveals that HCDIM achieves substantial performance improvement over state-of-the-art techniques.