Transient-optimised real-bogus classification with Bayesian Convolutional Neural Networks — sifting the GOTO candidate stream

Abstract Large-scale sky surveys have played a transformative role in our understanding of astrophysical transients, only made possible by increasingly powerful machine learning-based filtering to accurately sift through the vast quantities of incoming data generated. In this paper, we present a new real-bogus classifier based on a Bayesian convolutional neural network that provides nuanced, uncertainty-aware classification of transient candidates in difference imaging, and demonstrate its application to the datastream from the GOTO wide-field optical survey. Not only are candidates assigned a well-calibrated probability of being real, but also an associated confidence that can be used to prioritise human vetting efforts and inform future model optimisation via active learning. To fully realise the potential of this architecture, we present a fully-automated training set generation method which requires no human labelling, incorporating a novel data-driven augmentation method to significantly improve the recovery of faint and nuclear transient sources. We achieve competitive classification accuracy (FPR and FNR both below 1%) compared against classifiers trained with fully human-labelled datasets, whilst being significantly quicker and less labour-intensive to build. This data-driven approach is uniquely scalable to the upcoming challenges and data needs of next-generation transient surveys. We make our data generation and model training codes available to the community.

Download Full-text

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

Algorithms ◽

10.3390/a14050154 ◽

2021 ◽

Vol 14 (5) ◽

pp. 154

Author(s):

Marcus Walldén ◽

Masao Okita ◽

Fumihiko Ino ◽

Dimitris Drikakis ◽

Ioannis Kokkinakis

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Sets ◽

Output Constraints ◽

Data Driven Approach ◽

Scientific Simulations ◽

Multiple Metrics ◽

In Transit ◽

Multiple Compression ◽

Large Scale Simulations

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of 1.29× in a lossless scenario. The data decompression time was sped up by 2× compared to using a single compression method uniformly.

Download Full-text

Improving the management of type 2 diabetes through large-scale general practice: the role of a data-driven and technology-enabled education programme

BMJ Open Quality ◽

10.1136/bmjoq-2020-001087 ◽

2021 ◽

Vol 10 (1) ◽

pp. e001087

Author(s):

Tarek F Radwan ◽

Yvette Agyako ◽

Alireza Ettefaghian ◽

Tahira Kamran ◽

Omar Din ◽

...

Keyword(s):

Type 2 Diabetes ◽

Primary Care ◽

Large Scale ◽

Education Programme ◽

Educational Programme ◽

Data Driven ◽

Treatment Targets ◽

Care Processes ◽

Data Driven Approach

A quality improvement (QI) scheme was launched in 2017, covering a large group of 25 general practices working with a deprived registered population. The aim was to improve the measurable quality of care in a population where type 2 diabetes (T2D) care had previously proved challenging. A complex set of QI interventions were co-designed by a team of primary care clinicians and educationalists and managers. These interventions included organisation-wide goal setting, using a data-driven approach, ensuring staff engagement, implementing an educational programme for pharmacists, facilitating web-based QI learning at-scale and using methods which ensured sustainability. This programme was used to optimise the management of T2D through improving the eight care processes and three treatment targets which form part of the annual national diabetes audit for patients with T2D. With the implemented improvement interventions, there was significant improvement in all care processes and all treatment targets for patients with diabetes. Achievement of all the eight care processes improved by 46.0% (p<0.001) while achievement of all three treatment targets improved by 13.5% (p<0.001). The QI programme provides an example of a data-driven large-scale multicomponent intervention delivered in primary care in ethnically diverse and socially deprived areas.

Download Full-text

RAFFI: Accurate and fast familial relationship inference in large scale biobank studies using RaPID

PLoS Genetics ◽

10.1371/journal.pgen.1009315 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1009315

Author(s):

Ardalan Naseri ◽

Junjie Shi ◽

Xihong Lin ◽

Shaojie Zhang ◽

Degui Zhi

Keyword(s):

Large Scale ◽

Association Studies ◽

Scale Up ◽

Data Driven ◽

Genome Wide Association Studies ◽

Inference Method ◽

Genome Wide ◽

Familial Relationship ◽

Kinship Coefficients ◽

Data Driven Approach

Inference of relationships from whole-genome genetic data of a cohort is a crucial prerequisite for genome-wide association studies. Typically, relationships are inferred by computing the kinship coefficients (ϕ) and the genome-wide probability of zero IBD sharing (π0) among all pairs of individuals. Current leading methods are based on pairwise comparisons, which may not scale up to very large cohorts (e.g., sample size >1 million). Here, we propose an efficient relationship inference method, RAFFI. RAFFI leverages the efficient RaPID method to call IBD segments first, then estimate the ϕ and π0 from detected IBD segments. This inference is achieved by a data-driven approach that adjusts the estimation based on phasing quality and genotyping quality. Using simulations, we showed that RAFFI is robust against phasing/genotyping errors, admix events, and varying marker densities, and achieves higher accuracy compared to KING, the current leading method, especially for more distant relatives. When applied to the phased UK Biobank data with ~500K individuals, RAFFI is approximately 18 times faster than KING. We expect RAFFI will offer fast and accurate relatedness inference for even larger cohorts.

Download Full-text

Classification of metal ions according to their complexing properties: a data-driven approach

Analytica Chimica Acta ◽

10.1016/s0003-2670(01)01571-9 ◽

2002 ◽

Vol 455 (1) ◽

pp. 131-142 ◽

Cited By ~ 19

Author(s):

Igor V. Pletnev ◽

Vladimir V. Zernov

Keyword(s):

Metal Ions ◽

Data Driven ◽

Data Driven Approach ◽

Complexing Properties

Download Full-text

A study of English blends: From structure to meaning and back again

WORD Structure ◽

10.3366/word.2014.0055 ◽

2014 ◽

Vol 7 (1) ◽

pp. 29-54 ◽

Cited By ~ 10

Author(s):

Natalia Beliaeva

Keyword(s):

Data Driven ◽

Morphological Classification ◽

Multifactorial Analysis ◽

Structural Differences ◽

Data Driven Approach ◽

Semantic Properties

This article presents an approach to the resolution of the much discussed problem of morphological classification of blend words and their distinction from such neighbouring morphological categories as clipping compounds. The research focuses on novel coinages and takes a data-driven approach to study the interaction between the form and the meaning of blends/clipping compounds. A multifactorial analysis of formal and semantic properties of these words is undertaken, as a result of which phonological and structural differences between blends and clipping compounds are explained using formal and semantic factors.

Download Full-text

Inferring dynamic topology for decoding spatiotemporal structures in complex heterogeneous networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1721286115 ◽

2018 ◽

Vol 115 (37) ◽

pp. 9300-9305 ◽

Cited By ~ 10

Author(s):

Shuo Wang ◽

Erik D. Herzog ◽

István Z. Kiss ◽

William J. Schwartz ◽

Guy Bloch ◽

...

Keyword(s):

Network Topology ◽

Large Scale ◽

Data Driven ◽

Electrical Networks ◽

Chemical Oscillators ◽

Complex Interactions ◽

Dynamic Topology ◽

Estimation Problems ◽

Data Driven Approach ◽

Fast Chemical

Extracting complex interactions (i.e., dynamic topologies) has been an essential, but difficult, step toward understanding large, complex, and diverse systems including biological, financial, and electrical networks. However, reliable and efficient methods for the recovery or estimation of network topology remain a challenge due to the tremendous scale of emerging systems (e.g., brain and social networks) and the inherent nonlinearity within and between individual units. We develop a unified, data-driven approach to efficiently infer connections of networks (ICON). We apply ICON to determine topology of networks of oscillators with different periodicities, degree nodes, coupling functions, and time scales, arising in silico, and in electrochemistry, neuronal networks, and groups of mice. This method enables the formulation of these large-scale, nonlinear estimation problems as a linear inverse problem that can be solved using parallel computing. Working with data from networks, ICON is robust and versatile enough to reliably reveal full and partial resonance among fast chemical oscillators, coherent circadian rhythms among hundreds of cells, and functional connectivity mediating social synchronization of circadian rhythmicity among mice over weeks.

Download Full-text

Screening performance of abbreviated versions of the UPSIT smell test

10.1101/443127 ◽

2018 ◽

Author(s):

Theresita Joseph ◽

Stephen D. Auger ◽

Luisa Peress ◽

Daniel Rack ◽

Jack Cuzick ◽

...

Keyword(s):

Large Scale ◽

Effective Means ◽

Cost Effective ◽

Data Driven ◽

Screening Performance ◽

Smell Test ◽

Identification Test ◽

Data Driven Approach ◽

Smell Identification ◽

Smell Tests

ABSTRACTBackgroundHyposmia features in several neurodegenerative conditions, including Parkinson’s disease (PD). The University of Pennsylvania Smell Identification Test (UPSIT) is a widely used screening tool for detecting hyposmia, but is time-consuming and expensive when used on a large scale.MethodsWe assessed shorter subsets of UPSIT items for their ability to detect hyposmia in 891 healthy participants from the PREDICT-PD study. Established shorter tests included Versions A and B of both the 4-item Pocket Smell Test (PST) and 12-item Brief Smell Identification Test (BSIT). Using a data-driven approach, we evaluated screening performances of 23,231,378 combinations of 1-7 smell items from the full UPSIT.ResultsPST Versions A and B achieved sensitivity/specificity of 76.8%/64.9% and 86.6%/45.9% respectively, whilst BSIT Versions A and B achieved 83.1%/79.5% and 96.5%/51.8% for detecting hyposmia defined by the longer UPSIT. From the data-driven analysis, two optimised sets of 7 smells surpassed the screening performance of the 12 item BSITs (with validation sensitivity/specificities of 88.2%/85.4% and 100%/53.5%). A set of 4 smells (Menthol, Clove, Gingerbread and Orange) had higher sensitivity for hyposmia than PST-A, -B and even BSIT-A (with validation sensitivity 91.2%). The same 4 smells also featured amongst those most commonly misidentified by 44 individuals with PD compared to 891 PREDICT-PD controls and a screening test using these 4 smells would have identified all hyposmic patients with PD.ConclusionUsing abbreviated smell tests could provide a cost-effective means of screening for hyposmia in large cohorts, allowing more targeted administration of the UPSIT or similar smell tests.

Download Full-text

A Method for Identifying Environmental Stimuli and Genes Responsible for Genotype-by-Environment Interactions From a Large-Scale Multi-Environment Data Set

Frontiers in Genetics ◽

10.3389/fgene.2021.803636 ◽

2021 ◽

Vol 12 ◽

Author(s):

Akio Onogi ◽

Daisuke Sekine ◽

Akito Kaga ◽

Satoshi Nakano ◽

Tetsuya Yamada ◽

...

Keyword(s):

Large Scale ◽

Genetic Correlations ◽

Data Driven ◽

Data Sets ◽

Data Set ◽

Environmental Stimuli ◽

Genotype By Environment ◽

Genome Wide ◽

Sowing Dates ◽

Data Driven Approach

It has not been fully understood in real fields what environment stimuli cause the genotype-by-environment (G × E) interactions, when they occur, and what genes react to them. Large-scale multi-environment data sets are attractive data sources for these purposes because they potentially experienced various environmental conditions. Here we developed a data-driven approach termed Environmental Covariate Search Affecting Genetic Correlations (ECGC) to identify environmental stimuli and genes responsible for the G × E interactions from large-scale multi-environment data sets. ECGC was applied to a soybean (Glycine max) data set that consisted of 25,158 records collected at 52 environments. ECGC illustrated what meteorological factors shaped the G × E interactions in six traits including yield, flowering time, and protein content and when these factors were involved in the interactions. For example, it illustrated the relevance of precipitation around sowing dates and hours of sunshine just before maturity to the interactions observed for yield. Moreover, genome-wide association mapping on the sensitivities to the identified stimuli discovered candidate and known genes responsible for the G × E interactions. Our results demonstrate the capability of data-driven approaches to bring novel insights on the G × E interactions observed in fields.

Download Full-text

CN-Probase: A Data-Driven Approach for Large-Scale Chinese Taxonomy Construction

2019 IEEE 35th International Conference on Data Engineering (ICDE) ◽

10.1109/icde.2019.00178 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jindong Chen ◽

Ao Wang ◽

Jiangjie Chen ◽

Yanghua Xiao ◽

Zhendong Chu ◽

...

Keyword(s):

Large Scale ◽

Data Driven ◽

Data Driven Approach

Download Full-text

Large-Scale Whale-Call Classification by Transfer Learning on Multi-Scale Waveforms and Time-Frequency Features

Applied Sciences ◽

10.3390/app9051020 ◽

2019 ◽

Vol 9 (5) ◽

pp. 1020 ◽

Cited By ~ 6

Author(s):

Lilun Zhang ◽

Dezhi Wang ◽

Changchun Bao ◽

Yongxian Wang ◽

Kele Xu

Keyword(s):

Transfer Learning ◽

Large Scale ◽

Data Augmentation ◽

Feature Representation ◽

Biological Research ◽

Time Frequency ◽

Feature Representations ◽

Multi Scale ◽

Data Driven Approach

Whale vocal calls contain valuable information and abundant characteristics that are important for classification of whale sub-populations and related biological research. In this study, an effective data-driven approach based on pre-trained Convolutional Neural Networks (CNN) using multi-scale waveforms and time-frequency feature representations is developed in order to perform the classification of whale calls from a large open-source dataset recorded by sensors carried by whales. Specifically, the classification is carried out through a transfer learning approach by using pre-trained state-of-the-art CNN models in the field of computer vision. 1D raw waveforms and 2D log-mel features of the whale-call data are respectively used as the input of CNN models. For raw waveform input, windows are applied to capture multiple sketches of a whale-call clip at different time scales and stack the features from different sketches for classification. When using the log-mel features, the delta and delta-delta features are also calculated to produce a 3-channel feature representation for analysis. In the training, a 4-fold cross-validation technique is employed to reduce the overfitting effect, while the Mix-up technique is also applied to implement data augmentation in order to further improve the system performance. The results show that the proposed method can improve the accuracies by more than 20% in percentage for the classification into 16 whale pods compared with the baseline method using groups of 2D shape descriptors of spectrograms and the Fisher discriminant scores on the same dataset. Moreover, it is shown that classifications based on log-mel features have higher accuracies than those based directly on raw waveforms. The phylogeny graph is also produced to significantly illustrate the relationships among the whale sub-populations.

Download Full-text