Global tests for novelty

2015 ◽  
Vol 26 (4) ◽  
pp. 1867-1880
Author(s):  
Ilmari Ahonen ◽  
Denis Larocque ◽  
Jaakko Nevalainen

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.


2018 ◽  
Author(s):  
Adrian Fritz ◽  
Peter Hofmann ◽  
Stephan Majda ◽  
Eik Dahms ◽  
Johannes Dröge ◽  
...  

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM



2015 ◽  
Vol 14 (03) ◽  
pp. 521-533
Author(s):  
M. Sariyar ◽  
A. Borg

Deterministic record linkage (RL) is frequently regarded as a rival to more sophisticated strategies like probabilistic RL. We investigate the effect of combining deterministic linkage with other linkage techniques. For this task, we use a simple deterministic linkage strategy as a preceding filter: a data pair is classified as ‘match' if all values of attributes considered agree exactly, otherwise as ‘nonmatch'. This strategy is separately combined with two probabilistic RL methods based on the Fellegi–Sunter model and with two classification tree methods (CART and Bagging). An empirical comparison was conducted on two real data sets. We used four different partitions into training data and test data to increase the validity of the results. In almost all cases, application of deterministic linkage as a preceding filter leads to better results compared to the omission of such a pre-filter, and overall classification trees exhibited best results. On all data sets, probabilistic RL only profited from deterministic linkage when the underlying probabilities were estimated before applying deterministic linkage. When using a pre-filter for subtracting definite cases, the underlying population of data pairs changes. It is crucial to take this into account for model-based probabilistic RL.



Aviation ◽  
2013 ◽  
Vol 17 (2) ◽  
pp. 52-56 ◽  
Author(s):  
Mykola Kulyk ◽  
Sergiy Dmitriev ◽  
Oleksandr Yakushenko ◽  
Oleksandr Popov

A method of obtaining test and training data sets has been developed. These sets are intended for training a static neural network to recognise individual and double defects in the air-gas path units of a gas-turbine engine. These data are obtained by using operational process parameters of the air-gas path of a bypass turbofan engine. The method allows sets that can project some changes in the technical conditions of a gas-turbine engine to be received, taking into account errors that occur in the measurement of the gas-dynamic parameters of the air-gas path. The operation of the engine in a wide range of modes should also be taken into account.



2020 ◽  
Author(s):  
Yarden Cohen ◽  
David Nicholson ◽  
Alexa Sanchioni ◽  
Emily K. Mallaber ◽  
Viktoriya Skidanova ◽  
...  

AbstractSongbirds have long been studied as a model system of sensory-motor learning. Many analyses of birdsong require time-consuming manual annotation of the individual elements of song, known as syllables or notes. Here we describe the first automated algorithm for birdsong annotation that is applicable to complex song such as canary song. We developed a neural network architecture, “TweetyNet”, that is trained with a small amount of hand-labeled data using supervised learning methods. We first show TweetyNet achieves significantly lower error on Bengalese finch song than a similar method, using less training data, and maintains low error rates across days. Applied to canary song, TweetyNet achieves fully automated annotation of canary song, accurately capturing the complex statistical structure previously discovered in a manually annotated dataset. We conclude that TweetyNet will make it possible to ask a wide range of new questions focused on complex songs where manual annotation was impractical.



2020 ◽  
Vol 36 (4) ◽  
pp. 803-825
Author(s):  
Marco Fortini

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.



Author(s):  
LEV V. UTKIN

A new approach for ensemble construction based on restricting a set of weights of examples in training data to avoid overfitting is proposed in the paper. The algorithm called EPIBoost (Extreme Points Imprecise Boost) applies imprecise statistical models to restrict the set of weights. The updating of the weights within the restricted set is carried out by using its extreme points. The approach allows us to construct various algorithms by applying different imprecise statistical models for producing the restricted set. It is shown by various numerical experiments with real data sets that the EPIBoost algorithm may outperform the standard AdaBoost for some parameters of imprecise statistical models.



Author(s):  
Т.В. Речкалов ◽  
М.Л. Цымблер

Алгоритм PAM (Partitioning Around Medoids) представляет собой разделительный алгоритм кластеризации, в котором в качестве центров кластеров выбираются только кластеризуемые объекты (медоиды). Кластеризация на основе техники медоидов применяется в широком спектре приложений: сегментирование медицинских и спутниковых изображений, анализ ДНК-микрочипов и текстов и др. На сегодня имеются параллельные реализации PAM для систем GPU и FPGA, но отсутствуют таковые для многоядерных ускорителей архитектуры Intel Many Integrated Core (MIC). В настоящей статье предлагается новый параллельный алгоритм кластеризации PhiPAM для ускорителей Intel MIC. Вычисления распараллеливаются с помощью технологии OpenMP. Алгоритм предполагает использование специализированной компоновки данных в памяти и техники тайлинга, позволяющих эффективно векторизовать вычисления на системах Intel MIC. Эксперименты, проведенные на реальных наборах данных, показали хорошую масштабируемость алгоритма. The PAM (Partitioning Around Medoids) is a partitioning clustering algorithm where each cluster is represented by an object from the input dataset (called a medoid). The medoid-based clustering is used in a wide range of applications: the segmentation of medical and satellite images, the analysis of DNA microarrays and texts, etc. Currently, there are parallel implementations of PAM for GPU and FPGA systems, but not for Intel Many Integrated Core (MIC) accelerators. In this paper, we propose a novel parallel PhiPAM clustering algorithm for Intel MIC systems. Computations are parallelized by the OpenMP technology. The algorithm exploits a sophisticated memory data layout and loop tiling technique, which allows one to efficiently vectorize computations with Intel MIC. Experiments performed on real data sets show a good scalability of the algorithm.



2017 ◽  
Author(s):  
S. Hong Lee ◽  
Sam Clark ◽  
Julius H.J. van der Werf

ABSTRACTGenomic prediction is emerging in a wide range of fields including animal and plant breeding, risk prediction in human precision medicine and forensic. It is desirable to establish a theoretical framework for genomic prediction accuracy when the reference data consists of information sources with varying degrees of relationship to the target individuals. A reference set can contain both close and distant relatives as well as ‘unrelated’ individuals from the wider population in the genomic prediction. The various sources of information were modeled as different populations with different effective population sizes (Ne). Both the effective number of chromosome segments (Me) and Ne are considered to be a function of the data used for prediction. We validate our theory with analyses of simulated as well as real data, and illustrate that the variation in genomic relationships with the target is a predictor of the information content of the reference set. With a similar amount of data available for each source, we show that close relatives can have a substantially larger effect on genomic prediction accuracy than lesser related individuals. We also illustrate that when prediction relies on closer relatives, there is less improvement in prediction accuracy with an increase in training data or marker panel density. We release software that can estimate the expected prediction accuracy and power when combining different reference sources with various degrees of relationship to the target, which is useful when planning genomic prediction (before or after collecting data) in animal, plant and human genetics.



Atmosphere ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 644
Author(s):  
Jingyan Huang ◽  
Michael Kwok Po Ng ◽  
Pak Wai Chan

The main aim of this paper is to propose a statistical indicator for wind shear prediction from Light Detection and Ranging (LIDAR) observational data. Accurate warning signal of wind shear is particularly important for aviation safety. The main challenges are that wind shear may result from a sustained change of the headwind and the possible velocity of wind shear may have a wide range. Traditionally, aviation models based on terrain-induced setting are used to detect wind shear phenomena. Different from traditional methods, we study a statistical indicator which is used to measure the variation of headwinds from multiple headwind profiles. Because the indicator value is nonnegative, a decision rule based on one-side normal distribution is employed to distinguish wind shear cases and non-wind shear cases. Experimental results based on real data sets obtained at Hong Kong International Airport runway are presented to demonstrate that the proposed indicator is quite effective. The prediction performance of the proposed method is better than that by the supervised learning methods (LDA, KNN, SVM, and logistic regression). This model would also provide more accurate warnings of wind shear for pilots and improve the performance of Wind shear and Turbulence Warning System.



2021 ◽  
Vol 9 (1) ◽  
pp. 62-81
Author(s):  
Kjersti Aas ◽  
Thomas Nagler ◽  
Martin Jullum ◽  
Anders Løland

Abstract In this paper the goal is to explain predictions from complex machine learning models. One method that has become very popular during the last few years is Shapley values. The original development of Shapley values for prediction explanation relied on the assumption that the features being described were independent. If the features in reality are dependent this may lead to incorrect explanations. Hence, there have recently been attempts of appropriately modelling/estimating the dependence between the features. Although the previously proposed methods clearly outperform the traditional approach assuming independence, they have their weaknesses. In this paper we propose two new approaches for modelling the dependence between the features. Both approaches are based on vine copulas, which are flexible tools for modelling multivariate non-Gaussian distributions able to characterise a wide range of complex dependencies. The performance of the proposed methods is evaluated on simulated data sets and a real data set. The experiments demonstrate that the vine copula approaches give more accurate approximations to the true Shapley values than their competitors.



Sign in / Sign up

Export Citation Format

Share Document