scholarly journals A Simple and Efficient Pipeline for Construction, Merging, Expansion, and Simulation of Large-Scale, Single-Cell Mechanistic Models

2020 ◽  
Author(s):  
Cemal Erdem ◽  
Ethan M. Bensman ◽  
Arnab Mutsuddy ◽  
Michael M. Saint-Antoine ◽  
Mehdi Bouhaddou ◽  
...  

ABSTRACTThe current era of big biomedical data accumulation and availability brings data integration opportunities for leveraging its totality to make new discoveries and/or clinically predictive models. Black-box statistical and machine learning methods are powerful for such integration, but often cannot provide mechanistic reasoning, particularly on the single-cell level. While single-cell mechanistic models clearly enable such reasoning, they are predominantly “small-scale”, and struggle with the scalability and reusability required for meaningful data integration. Here, we present an open-source pipeline for scalable, single-cell mechanistic modeling from simple, annotated input files that can serve as a foundation for mechanistic data integration. As a test case, we convert one of the largest existing single-cell mechanistic models to this format, demonstrating robustness and reproducibility of the approach. We show that the model cell line context can be changed with simple replacement of input file parameter values. We next use this new model to test alternative mechanistic hypotheses for the experimental observations that interferon-gamma (IFNG) inhibits epidermal growth factor (EGF)-induced cell proliferation. Model- based analysis suggested, and experiments support that these observations are better explained by IFNG-induced SOCS1 expression sequestering activated EGF receptors, thereby downregulating AKT activity, as opposed to direct IFNG-induced upregulation of p21 expression. Overall, this new pipeline enables large-scale, single-cell, and mechanistically-transparent modeling as a data integration modality complementary to machine learning.

2021 ◽  
Vol 11 (2) ◽  
pp. 472
Author(s):  
Hyeongmin Cho ◽  
Sangkyun Lee

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.


2021 ◽  
Author(s):  
Aleksandar Kovačević ◽  
Jelena Slivka ◽  
Dragan Vidaković ◽  
Katarina-Glorija Grujić ◽  
Nikola Luburić ◽  
...  

<p>Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. </p><p>This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT).<br></p><p>We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach.<br></p><p>This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.<br></p>


2020 ◽  
Author(s):  
Tom Rowan ◽  
Adrian Butler

&lt;p&gt;&lt;span&gt;In order to enable community groups and other interested parties to evaluate the effects of flood management, water conservation and other hydrological issues, better localised mapping is required.&amp;#160; Although some maps are publicly available many are behind paywalls, especially those with three dimensional features. &amp;#160;In this study London is used as a test case to evaluate, machine learning and rules-based approaches with opensource maps and LiDAR data to create more accurate representations (LOD2) of small-scale areas. &amp;#160;Machine learning is particularly well suited to the recognition of local repetitive features like building roofs and trees, while roads can be identified and mapped best using a faster rules-based approach. &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;In order to create a useful LOD2 representation, a user interface, processing rules manipulation and assumption editor have all been incorporated. Features like randomly assigning sub terrain features (basements) - using Monte-Carlo methods - and artificial sewage representation enable the user to grow these models from opensource data into useful model inputs. This project is aimed at local scale hydrological modelling, rainfall runoff analysis and other local planning applications. &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;&amp;#160;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span&gt;The goal is to provide turn-key data processing for small scale modelling, which should help advance the installation of SuDs and other water management solutions, as well as having broader uses. The method is designed to enable fast and accurate representations of small-scale features (1 hectare to 1km&lt;sup&gt;2&lt;/sup&gt;), with larger scale applications planned for future work. &amp;#160;This work forms part of the CAMELLIA project (Community Water Management for a Liveable London) and aims to provide useful tools for local scale modeller and possibly the larger scale industry/scientific user. &lt;/span&gt;&lt;/p&gt;


Author(s):  
S. Bhushan ◽  
D. K. Walters ◽  
E. Merzari ◽  
A. Obabko

A dynamic hybrid RANS/LES (DHRL) model has been implemented in the spectral-element solver Nek5000 to reduce computational expense for high Reynolds number applications. The model couples a k-ε URANS model and the dynamic Smagorinsky model for LES. The model is validated for plane channel flow at Reτ = 590 using DNS data, and compared with LES predictions. The model is then applied for the ANL-MAX case, which is a test case relevant to nuclear reactor cooling flow simulations. For the channel flow case, DHRL predictions were similar to LES on finer grids, but on coarser grids, the former predicted velocity profiles closer to DNS than the latter in the log-layer region. The improved prediction by the DHRL model was identified to be due to a 30% additional contribution of RANS stresses. For the ANL-MAX case, the URANS simulation predicts quasi-steady flow, with dominant large-scale turbulent structures, whereas LES predicts small-scale turbulent structures comparable with results in rapid mixing of cool and warm flow jets. DHRL simulations predict LES mode in the inlet jet region, and URANS mode elsewhere, as expected.


2018 ◽  
Author(s):  
Jong-Eun Park ◽  
Krzysztof Polański ◽  
Kerstin Meyer ◽  
Sarah A. Teichmann

AbstractIncreasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at https://github.com/Teichlab/bbknn.


2021 ◽  
Author(s):  
Aleksandar Kovačević ◽  
Jelena Slivka ◽  
Dragan Vidaković ◽  
Katarina-Glorija Grujić ◽  
Nikola Luburić ◽  
...  

<p>Code smells are structures in code that often have a negative impact on its quality. Manually detecting code smells is challenging and researchers proposed many automatic code smell detectors. Most of the studies propose detectors based on code metrics and heuristics. However, these studies have several limitations, including evaluating the detectors using small-scale case studies and an inconsistent experimental setting. Furthermore, heuristic-based detectors suffer from limitations that hinder their adoption in practice. Thus, researchers have recently started experimenting with machine learning (ML) based code smell detection. </p><p>This paper compares the performance of multiple ML-based code smell detection models against multiple traditionally employed metric-based heuristics for detection of God Class and Long Method code smells. We evaluate the effectiveness of different source code representations for machine learning: traditionally used code metrics and code embeddings (code2vec, code2seq, and CuBERT).<br></p><p>We perform our experiments on the large-scale, manually labeled MLCQ dataset. We consider the binary classification problem – we classify the code samples as smelly or non-smelly and use the F1-measure of the minority (smell) class as a measure of performance. In our experiments, the ML classifier trained using CuBERT source code embeddings achieved the best performance for both God Class (F-measure of 0.53) and Long Method detection (F-measure of 0.75). With the help of a domain expert, we perform the error analysis to discuss the advantages of the CuBERT approach.<br></p><p>This study is the first to evaluate the effectiveness of pre-trained neural source code embeddings for code smell detection to the best of our knowledge. A secondary contribution of our study is the systematic evaluation of the effectiveness of multiple heuristic-based approaches on the same large-scale, manually labeled MLCQ dataset.<br></p>


2019 ◽  
Vol 21 (4) ◽  
pp. 1209-1223 ◽  
Author(s):  
Raphael Petegrosso ◽  
Zhuliu Li ◽  
Rui Kuang

Abstract   Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics. Availability All the source code and data are available at https://github.com/kuanglab/single-cell-review.


2019 ◽  
Vol 1 ◽  
pp. 1-2 ◽  
Author(s):  
Izabela Karsznia ◽  
Karolina Sielicka

<p><strong>Abstract.</strong> The decision about removing or maintaining an object while changing detail level requires taking into account many features of the object itself and its surrounding. Automatic generalization is the optimal way to obtain maps at various scales, based on a single spatial database, storing up-to-date information with a high level of spatial accuracy. Researchers agree on the need for fully automating the generalization process (Stoter et al., 2016). Numerous research centres, cartographic agencies as well as commercial companies have undertaken successful attempts of implementing certain generalization solutions (Stoter et al., 2009, 2014, 2016; Regnauld, 2015; Burghardt et al., 2008; Chaundhry and Mackaness, 2008). Nevertheless, an effective and consistent methodology for generalizing small-scale maps has not gained enough attention so far, as most of the conducted research has focused on the acquisition of large-scale maps (Stoter et al., 2016). The presented research aims to fulfil this gap by exploring new variables, which are of the key importance in the automatic settlement selection process at small scales. Addressing this issue is an essential step to propose new algorithms for effective and automatic settlement selection that will contribute to enriching, the sparsely filled small-scale generalization toolbox.</p><p>The main idea behind this research is using machine learning (ML) for the new variable exploration which can be important in the automatic settlement generalization in small-scales. For automation of the generalization process, cartographic knowledge has to be collected and formalized. So far, a few approaches based on the use of ML have already been proposed. One of the first attempts to determine generalization parameters with the use of ML was performed by Weibel et al. (1995). The learning material was the observation of cartographers manual work. Also, Mustière tried to identify the optimal sequence of the generalization operators for the roads using ML (1998). A different approach was presented by Sester (2000). The goal was to extract the cartographic knowledge from spatial data characteristics, especially from the attributes and geometric properties of objects, regularities and repetitive patterns that govern object selection with the use of decision trees. Lagrange et al. (2000), Balboa and López (2008) also used ML techniques, namely neural networks to generalize line objects. Recently, Sester et al. (2018) proposed the application of deep learning for the task of building generalization. As noticed by Sester et al. (2018), these ideas, although interesting, remained proofs of concepts only. Moreover, they concerned topographic databases and large-scale maps. Promising results of automatic settlement selection in small scales was reported by Karsznia and Weibel (2018). To improve the settlement selection process, they have used data enrichment and ML. Thanks to classification models based on the decision trees, they explored new variables that are decisive in the settlement selection process. However, they have also concluded that there is probably still more “deep knowledge” to be discovered, possibly linked to further variables that were not included in their research. Thus the motivation for this research is to fulfil this research gap and look for additional, essential variables governing settlement selection in small scales.</p>


2020 ◽  
Author(s):  
Vita Ayoub ◽  
Carole Delenne ◽  
Patrick Matgen ◽  
Pascal Finaud-Guyot ◽  
Renaud Hostache

&lt;p&gt;&lt;span&gt;In hydrodynamic modelling, the mesh resolution has a strong impact on run time and result accuracy. Coarser meshes allow faster simulations but often at the cost of accuracy. Conversely, finer meshes offer a better description of complex geometries but require much longer computational time, which makes their use at a large scale challenging. In this context, we aim to assess the potential of a two-dimensional shallow water model with depth-dependant porosity (SW2D-DDP) for flood simulations at a large scale. This modelling approach relies on nesting a sub-grid mesh containing high-resolution topographic and bathymetric data within each computational cell via a so-called depth-dependant storage porosity. It enables therefore faster simulations on rather coarse grids while preserving small-scale topography information. The July 2007 flood event in the Severn River basin (UK) is used as a test case, for which hydrometric measurements and spatial data are available for evaluation. A sensitivity analysis is carried out to investigate the porosity influence on the model performance in comparison with other classical parameters such as boundary conditions.&lt;/span&gt;&lt;/p&gt;


Sign in / Sign up

Export Citation Format

Share Document