A Simple and Efficient Pipeline for Construction, Merging, Expansion, and Simulation of Large-Scale, Single-Cell Mechanistic Models

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Automatic detection of Long Method and God Class code smells through neural source code embeddings

10.36227/techrxiv.17206010.v1 ◽

2021 ◽

Author(s):

Aleksandar Kovačević ◽

Jelena Slivka ◽

Dragan Vidaković ◽

Katarina-Glorija Grujić ◽

Nikola Luburić ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Negative Impact ◽

Source Code ◽

Systematic Evaluation ◽

Small Scale ◽

Code Smells ◽

Code Metrics ◽

Code Smell ◽

F Measure

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Download Full-text

Large-Scale Machine Learning Algorithms for Biomedical Data Science

Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics - BCB '19 ◽

10.1145/3307339.3342130 ◽

2019 ◽

Author(s):

Heng Huang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Data Science ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Biomedical Data

Download Full-text

Towards opensource LOD2 modelling of urban spaces using an optimised machine learning and rules-based approach.

10.5194/egusphere-egu2020-21911 ◽

2020 ◽

Author(s):

Tom Rowan ◽

Adrian Butler

Keyword(s):

Water Management ◽

Machine Learning ◽

Water Conservation ◽

Three Dimensional ◽

Flood Management ◽

Local Scale ◽

Small Scale ◽

Test Case ◽

Local Planning ◽

Scale Modelling

In order to enable community groups and other interested parties to evaluate the effects of flood management, water conservation and other hydrological issues, better localised mapping is required.&#160; Although some maps are publicly available many are behind paywalls, especially those with three dimensional features. &#160;In this study London is used as a test case to evaluate, machine learning and rules-based approaches with opensource maps and LiDAR data to create more accurate representations (LOD2) of small-scale areas. &#160;Machine learning is particularly well suited to the recognition of local repetitive features like building roofs and trees, while roads can be identified and mapped best using a faster rules-based approach. In order to create a useful LOD2 representation, a user interface, processing rules manipulation and assumption editor have all been incorporated. Features like randomly assigning sub terrain features (basements) - using Monte-Carlo methods - and artificial sewage representation enable the user to grow these models from opensource data into useful model inputs. This project is aimed at local scale hydrological modelling, rainfall runoff analysis and other local planning applications. &#160;The goal is to provide turn-key data processing for small scale modelling, which should help advance the installation of SuDs and other water management solutions, as well as having broader uses. The method is designed to enable fast and accurate representations of small-scale features (1 hectare to 1km2), with larger scale applications planned for future work. &#160;This work forms part of the CAMELLIA project (Community Water Management for a Liveable London) and aims to provide useful tools for local scale modeller and possibly the larger scale industry/scientific user.

Download Full-text

Implementation and Validation of a Hybrid RANS/LES Model in the Spectral Element Solver Nek5000

Volume 1A, Symposia: Advances in Fluids Engineering Education; Turbomachinery Flow Predictions and Optimization; Applications in CFD; Bio-Inspired Fluid Mechanics; Droplet-Surface Interactions; CFD Verification and Validation; Development and Applications of Immersed Boundary Methods; DNS, LES, and Hybrid RANS/LES Methods ◽

10.1115/fedsm2014-22055 ◽

2014 ◽

Author(s):

S. Bhushan ◽

D. K. Walters ◽

E. Merzari ◽

A. Obabko

Keyword(s):

Channel Flow ◽

Large Scale ◽

Nuclear Reactor ◽

Plane Channel ◽

Spectral Element ◽

Small Scale ◽

Test Case ◽

Additional Contribution ◽

Turbulent Structures ◽

High Reynolds

A dynamic hybrid RANS/LES (DHRL) model has been implemented in the spectral-element solver Nek5000 to reduce computational expense for high Reynolds number applications. The model couples a k-ε URANS model and the dynamic Smagorinsky model for LES. The model is validated for plane channel flow at Reτ = 590 using DNS data, and compared with LES predictions. The model is then applied for the ANL-MAX case, which is a test case relevant to nuclear reactor cooling flow simulations. For the channel flow case, DHRL predictions were similar to LES on finer grids, but on coarser grids, the former predicted velocity profiles closer to DNS than the latter in the log-layer region. The improved prediction by the DHRL model was identified to be due to a 30% additional contribution of RANS stresses. For the ANL-MAX case, the URANS simulation predicts quasi-steady flow, with dominant large-scale turbulent structures, whereas LES predicts small-scale turbulent structures comparable with results in rapid mixing of cool and warm flow jets. DHRL simulations predict LES mode in the inlet jet region, and URANS mode elsewhere, as expected.

Download Full-text

Fast Batch Alignment of Single Cell Transcriptomes Unifies Multiple Mouse Cell Atlases into an Integrated Landscape

10.1101/397042 ◽

2018 ◽

Cited By ~ 21

Author(s):

Jong-Eun Park ◽

Krzysztof Polański ◽

Kerstin Meyer ◽

Sarah A. Teichmann

Keyword(s):

Data Integration ◽

Single Cell ◽

Large Scale ◽

Integration Method ◽

Mouse Cell ◽

Rna Seq ◽

Computational Tools ◽

Integrative Study ◽

Data Explosion ◽

Computationally Intensive

AbstractIncreasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at https://github.com/Teichlab/bbknn.

Download Full-text

Automatic detection of Long Method and God Class code smells through neural source code embeddings

10.36227/techrxiv.17206010 ◽

2021 ◽

Author(s):

Aleksandar Kovačević ◽

Jelena Slivka ◽

Dragan Vidaković ◽

Katarina-Glorija Grujić ◽

Nikola Luburić ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Negative Impact ◽

Source Code ◽

Systematic Evaluation ◽

Small Scale ◽

Code Smells ◽

Code Metrics ◽

Code Smell ◽

F Measure

Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT). We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach. This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.

Download Full-text

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Briefings in Bioinformatics ◽

10.1093/bib/bbz063 ◽

2019 ◽

Vol 21 (4) ◽

pp. 1209-1223 ◽

Cited By ~ 13

Author(s):

Raphael Petegrosso ◽

Zhuliu Li ◽

Rui Kuang

Keyword(s):

Machine Learning ◽

Single Cell ◽

Statistical Methods ◽

Large Scale ◽

Time Series Data ◽

Single Cells ◽

Transcriptome Profiling ◽

Cell Types ◽

Series Data ◽

Sequencing Data

Abstract Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability All the source code and data are available at https://github.com/kuanglab/single-cell-review.

Download Full-text

Exploring essential variables in the settlement selection for small-scale maps using machine learning

Abstracts of the ICA ◽

10.5194/ica-abs-1-162-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-2 ◽

Cited By ~ 1

Author(s):

Izabela Karsznia ◽

Karolina Sielicka

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Spatial Data ◽

Large Scale ◽

Selection Process ◽

Small Scale ◽

Learning Material ◽

Optimal Sequence ◽

Small Scales ◽

New Variables

Abstract. The decision about removing or maintaining an object while changing detail level requires taking into account many features of the object itself and its surrounding. Automatic generalization is the optimal way to obtain maps at various scales, based on a single spatial database, storing up-to-date information with a high level of spatial accuracy. Researchers agree on the need for fully automating the generalization process (Stoter et al., 2016). Numerous research centres, cartographic agencies as well as commercial companies have undertaken successful attempts of implementing certain generalization solutions (Stoter et al., 2009, 2014, 2016; Regnauld, 2015; Burghardt et al., 2008; Chaundhry and Mackaness, 2008). Nevertheless, an effective and consistent methodology for generalizing small-scale maps has not gained enough attention so far, as most of the conducted research has focused on the acquisition of large-scale maps (Stoter et al., 2016). The presented research aims to fulfil this gap by exploring new variables, which are of the key importance in the automatic settlement selection process at small scales. Addressing this issue is an essential step to propose new algorithms for effective and automatic settlement selection that will contribute to enriching, the sparsely filled small-scale generalization toolbox.The main idea behind this research is using machine learning (ML) for the new variable exploration which can be important in the automatic settlement generalization in small-scales. For automation of the generalization process, cartographic knowledge has to be collected and formalized. So far, a few approaches based on the use of ML have already been proposed. One of the first attempts to determine generalization parameters with the use of ML was performed by Weibel et al. (1995). The learning material was the observation of cartographers manual work. Also, Mustière tried to identify the optimal sequence of the generalization operators for the roads using ML (1998). A different approach was presented by Sester (2000). The goal was to extract the cartographic knowledge from spatial data characteristics, especially from the attributes and geometric properties of objects, regularities and repetitive patterns that govern object selection with the use of decision trees. Lagrange et al. (2000), Balboa and López (2008) also used ML techniques, namely neural networks to generalize line objects. Recently, Sester et al. (2018) proposed the application of deep learning for the task of building generalization. As noticed by Sester et al. (2018), these ideas, although interesting, remained proofs of concepts only. Moreover, they concerned topographic databases and large-scale maps. Promising results of automatic settlement selection in small scales was reported by Karsznia and Weibel (2018). To improve the settlement selection process, they have used data enrichment and ML. Thanks to classification models based on the decision trees, they explored new variables that are decisive in the settlement selection process. However, they have also concluded that there is probably still more “deep knowledge” to be discovered, possibly linked to further variables that were not included in their research. Thus the motivation for this research is to fulfil this research gap and look for additional, essential variables governing settlement selection in small scales.

Download Full-text

Towards fast large-scale flood simulations using 2D Shallow water modelling with depth-dependant porosity

10.5194/egusphere-egu2020-11562 ◽

2020 ◽

Author(s):

Vita Ayoub ◽

Carole Delenne ◽

Patrick Matgen ◽

Pascal Finaud-Guyot ◽

Renaud Hostache

Keyword(s):

Shallow Water ◽

Spatial Data ◽

Large Scale ◽

Model Performance ◽

Water Model ◽

Computational Time ◽

Small Scale ◽

Test Case ◽

Strong Impact ◽

Flood Simulations

In hydrodynamic modelling, the mesh resolution has a strong impact on run time and result accuracy. Coarser meshes allow faster simulations but often at the cost of accuracy. Conversely, finer meshes offer a better description of complex geometries but require much longer computational time, which makes their use at a large scale challenging. In this context, we aim to assess the potential of a two-dimensional shallow water model with depth-dependant porosity (SW2D-DDP) for flood simulations at a large scale. This modelling approach relies on nesting a sub-grid mesh containing high-resolution topographic and bathymetric data within each computational cell via a so-called depth-dependant storage porosity. It enables therefore faster simulations on rather coarse grids while preserving small-scale topography information. The July 2007 flood event in the Severn River basin (UK) is used as a test case, for which hydrometric measurements and spatial data are available for evaluation. A sensitivity analysis is carried out to investigate the porosity influence on the model performance in comparison with other classical parameters such as boundary conditions.

Download Full-text