scholarly journals Data Classification in Complex Networks via Pattern Conformation, Data Importance and Structural Optimization

2017 ◽  
Author(s):  
Murillo G. Carneiro ◽  
Liang Zhao

Most data classification techniques rely only on the physical features of the data (e.g., similarity, distance or distribution), which makes them difficult to detect intrinsic and semantic relations among data items, such as the pattern formation, for instance. In this thesis, it is proposed classification methods based on complex networks in order to consider not only physical features but also capture structural and dynamical properties of the data through the network representation. The proposed methods comprise concepts of pattern conformation, data importance and network structural optimization, which are related to complex networks theory, learning systems, and bioinspired optimization. Extensive experiments demonstrate the good performance of our methods when compared against representative state-of-the-art methods over a wide range of artificial and real data sets, including applications in domains such as heart disease diagnosis and semantic role labeling.

2018 ◽  
Author(s):  
Adrian Fritz ◽  
Peter Hofmann ◽  
Stephan Majda ◽  
Eik Dahms ◽  
Johannes Dröge ◽  
...  

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM


2019 ◽  
Vol 11 (11) ◽  
pp. 1288 ◽  
Author(s):  
Hossein Aghababaee ◽  
Giampaolo Ferraioli ◽  
Laurent Ferro-Famil ◽  
Gilda Schirinzi ◽  
Yue Huang

In the frame of polarimetric synthetic aperture radar (SAR) tomography, full-ranks reconstruction framework has been recognized as a significant technique for fully characterization of superimposed scatterers in a resolution cell. The technique, mainly is characterized by the advantages of polarimetric scattering pattern reconstruction, allows physical feature extraction of the scatterers. In this paper, to overcome the limitations of conventional full-rank tomographic techniques in natural environments, a polarimetric estimator with advantages of super-resolution imaging is proposed. Under the frame of compressive sensing (CS) and sparsity based reconstruction, the profile of second order polarimetric coherence matrix T is recovered. Once the polarimetric coherence matrices of the scatterers are available, the physical features can be extracted using classical polarimetric processing techniques. The objective of this study is to evaluate the performance of the proposed full-rank polarimetric reconstruction by means of conventional three-component decomposition of T, and focusing on the consistency of vertical resolution and polarimetric scattering pattern of the scatterers. The outcomes from simulated and two different real data sets confirm that significant improvement can be achieved in the reconstruction quality with respect to conventional approaches.


Author(s):  
Т.В. Речкалов ◽  
М.Л. Цымблер

Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризации, в котором в качестве центров кластеров выбираются только кластеризуемые объекты (медоиды). Кластеризация на основе техники медоидов применяется в широком спектре приложений: сегментирование медицинских и спутниковых изображений, анализ ДНК-микрочипов и текстов и др. На сегодня имеются параллельные реализации PAM для систем GPU и FPGA, но отсутствуют таковые для многоядерных ускорителей архитектуры Intel Many Integrated Core (MIC). В настоящей статье предлагается новый параллельный алгоритм кластеризации PhiPAM для ускорителей Intel MIC. Вычисления распараллеливаются с помощью технологии OpenMP. Алгоритм предполагает использование специализированной компоновки данных в памяти и техники тайлинга, позволяющих эффективно векторизовать вычисления на системах Intel MIC. Эксперименты, проведенные на реальных наборах данных, показали хорошую масштабируемость алгоритма. The PAM (Partitioning Around Medoids) is a partitioning clustering algorithm where each cluster is represented by an object from the input dataset (called a medoid). The medoid-based clustering is used in a wide range of applications: the segmentation of medical and satellite images, the analysis of DNA microarrays and texts, etc. Currently, there are parallel implementations of PAM for GPU and FPGA systems, but not for Intel Many Integrated Core (MIC) accelerators. In this paper, we propose a novel parallel PhiPAM clustering algorithm for Intel MIC systems. Computations are parallelized by the OpenMP technology. The algorithm exploits a sophisticated memory data layout and loop tiling technique, which allows one to efficiently vectorize computations with Intel MIC. Experiments performed on real data sets show a good scalability of the algorithm.


2015 ◽  
Vol 26 (4) ◽  
pp. 1867-1880
Author(s):  
Ilmari Ahonen ◽  
Denis Larocque ◽  
Jaakko Nevalainen

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.


2020 ◽  
Vol 8 (3) ◽  
Author(s):  
Mehdi Djellabi ◽  
Bertrand Jouve ◽  
Frédéric Amblard

Abstract The different approaches developed to analyse the structure of complex networks have generated a large number of studies. In the field of social networks at least, studies mainly address the detection and analysis of communities. In this article, we challenge these approaches and focus on nodes that have meaningful local interactions able to identify the internal organization of communities or the way communities are assembled. We propose an algorithm, ItRich, to identify this type of nodes, based on the decomposition of a graph into successive, less and less dense, layers. Our method is tested on synthetic and real data sets and meshes well with other methods such as community detection or $k$-core decomposition.


Atmosphere ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 644
Author(s):  
Jingyan Huang ◽  
Michael Kwok Po Ng ◽  
Pak Wai Chan

The main aim of this paper is to propose a statistical indicator for wind shear prediction from Light Detection and Ranging (LIDAR) observational data. Accurate warning signal of wind shear is particularly important for aviation safety. The main challenges are that wind shear may result from a sustained change of the headwind and the possible velocity of wind shear may have a wide range. Traditionally, aviation models based on terrain-induced setting are used to detect wind shear phenomena. Different from traditional methods, we study a statistical indicator which is used to measure the variation of headwinds from multiple headwind profiles. Because the indicator value is nonnegative, a decision rule based on one-side normal distribution is employed to distinguish wind shear cases and non-wind shear cases. Experimental results based on real data sets obtained at Hong Kong International Airport runway are presented to demonstrate that the proposed indicator is quite effective. The prediction performance of the proposed method is better than that by the supervised learning methods (LDA, KNN, SVM, and logistic regression). This model would also provide more accurate warnings of wind shear for pilots and improve the performance of Wind shear and Turbulence Warning System.


2021 ◽  
Vol 9 (1) ◽  
pp. 62-81
Author(s):  
Kjersti Aas ◽  
Thomas Nagler ◽  
Martin Jullum ◽  
Anders Løland

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.


2020 ◽  
Vol 36 (1) ◽  
pp. 25-48
Author(s):  
Kiranmoy Chatterjee ◽  
Diganta Mukherjee

AbstractWith the possibility of dependence between the sources in a capture-recapture type experiment, identification of the direction of such dependence in dual system of data collection is vital. This has a wide range of applications, including in the domains of public health, official statistics and social sciences. Owing to the insufficiency of data for analyzing a behavioral dependence model in dual system, our contribution lies in the construction of several strategies that can identify the direction of underlying dependence between the two lists in the dual system, that is, whether the two lists are positively or negatively dependent. Our proposed classification strategies would be quite appealing for improving the inference as evident from recent literature. Simulation studies are carried out to explore the comparative performance of the proposed strategies. Finally, applications on three real data sets from various fields are illustrated.


2019 ◽  
pp. 58-66
Author(s):  
Máté Nagy ◽  
János Tapolcai ◽  
Gábor Rétvári

Opportunistic data structures are used extensively in big data practice to break down the massive storage space requirements of processing large volumes of information. A data structure is called (singly) opportunistic if it takes advantage of the redundancy in the input in order to store it in iformationtheoretically minimum space. Yet, efficient data processing requires a separate index alongside the data, whose size often substantially exceeds that of the compressed information. In this paper, we introduce doubly opportunistic data structures to not only attain best possible compression on the input data but also on the index. We present R3D3 that encodes a bitvector of length n and Shannon entropy H0 to nH0 bits and the accompanying index to nH0(1/2 + O(log C/C)) bits, thus attaining provably minimum space (up to small error terms) on both the data and the index, and supports a rich set of queries to arbitrary position in the compressed bitvector in O(C) time when C = o(log n). Our R3D3 prototype attains several times space reduction beyond known compression techniques on a wide range of synthetic and real data sets, while it supports operations on the compressed data at comparable speed.


2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Dawei Zhang ◽  
Fuding Xie ◽  
Dapeng Wang ◽  
Yong Zhang ◽  
Yan Sun

Clustering data has a wide range of applications and has attracted considerable attention in data mining and artificial intelligence. However it is difficult to find a set of clusters that best fits natural partitions without any class information. In this paper, a method for detecting the optimal cluster number is proposed. The optimal cluster number can be obtained by the proposal, while partitioning the data into clusters by FCM (Fuzzyc-means) algorithm. It overcomes the drawback of FCM algorithm which needs to define the cluster numbercin advance. The method works by converting the fuzzy cluster result into a weighted bipartite network and then the optimal cluster number can be detected by the improved bipartite modularity. The experimental results on artificial and real data sets show the validity of the proposed method.


Sign in / Sign up

Export Citation Format

Share Document