scholarly journals Scaling associative classification for very large datasets

2017 ◽  
Vol 4 (1) ◽  
Author(s):  
Luca Venturini ◽  
Elena Baralis ◽  
Paolo Garza
2020 ◽  
Vol 196 ◽  
pp. 105777
Author(s):  
Jadson Jose Monteiro Oliveira ◽  
Robson Leonardo Ferreira Cordeiro

2001 ◽  
Vol 27 (11) ◽  
pp. 1457-1478 ◽  
Author(s):  
Michael D Beynon ◽  
Tahsin Kurc ◽  
Umit Catalyurek ◽  
Chialin Chang ◽  
Alan Sussman ◽  
...  

2019 ◽  
Vol 35 (19) ◽  
pp. 3608-3616
Author(s):  
Ashley A Superson ◽  
Doug Phelan ◽  
Allyson Dekovich ◽  
Fabia U Battistuzzi

Abstract Motivation The promise of higher phylogenetic stability through increased dataset sizes within tree of life (TOL) reconstructions has not been fulfilled. Among the many possible causes are changes in species composition (taxon sampling) that could influence phylogenetic accuracy of the methods by altering the relative weight of the evolutionary histories of each individual species. This effect would be stronger in clades that are represented by few lineages, which is common in many prokaryote phyla. Indeed, phyla with fewer taxa showed the most discordance among recent TOL studies. We implemented an approach to systematically test how the identity of taxa among a larger dataset and the number of taxa included affected the accuracy of phylogenetic reconstruction. Results Utilizing an empirical dataset within Terrabacteria we found that even within scenarios consisting of the same number of taxa, the species used strongly affected phylogenetic stability. Furthermore, we found that trees with fewer species were more dissimilar to the tree produced from the full dataset. These results hold even when the tree is composed by many phyla and only one of them is being altered. Thus, the effect of taxon sampling in one group does not seem to be buffered by the presence of many other clades, making this issue relevant even to very large datasets. Our results suggest that a systematic evaluation of phylogenetic stability through taxon resampling is advisable even for very large datasets. Availability and implementation https://github.com/BlabOaklandU/PATS.git. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Vol 3 (5) ◽  
pp. 160225 ◽  
Author(s):  
Rhodri S. Wilson ◽  
Lei Yang ◽  
Alison Dun ◽  
Annya M. Smyth ◽  
Rory R. Duncan ◽  
...  

Recent advances in optical microscopy have enabled the acquisition of very large datasets from living cells with unprecedented spatial and temporal resolutions. Our ability to process these datasets now plays an essential role in order to understand many biological processes. In this paper, we present an automated particle detection algorithm capable of operating in low signal-to-noise fluorescence microscopy environments and handling large datasets. When combined with our particle linking framework, it can provide hitherto intractable quantitative measurements describing the dynamics of large cohorts of cellular components from organelles to single molecules. We begin with validating the performance of our method on synthetic image data, and then extend the validation to include experiment images with ground truth. Finally, we apply the algorithm to two single-particle-tracking photo-activated localization microscopy biological datasets, acquired from living primary cells with very high temporal rates. Our analysis of the dynamics of very large cohorts of 10 000 s of membrane-associated protein molecules show that they behave as if caged in nanodomains. We show that the robustness and efficiency of our method provides a tool for the examination of single-molecule behaviour with unprecedented spatial detail and high acquisition rates.


Econometrics ◽  
2015 ◽  
Vol 3 (2) ◽  
pp. 317-338 ◽  
Author(s):  
Sandy Burden ◽  
Noel Cressie ◽  
David Steel

2015 ◽  
Vol 30 (6) ◽  
pp. 1781-1794 ◽  
Author(s):  
Adam J. Clark ◽  
Andrew MacKenzie ◽  
Amy McGovern ◽  
Valliappa Lakshmanan ◽  
Rodger A. Brown

Abstract Moisture boundaries, or drylines, are common over the southern U.S. high plains and are one of the most important airmass boundaries for convective initiation over this region. In favorable environments, drylines can initiate storms that produce strong and violent tornadoes, large hail, lightning, and heavy rainfall. Despite their importance, there are few studies documenting climatological dryline location and frequency, or performing systematic dryline forecast evaluation, which likely stems from difficulties in objectively identifying drylines over large datasets. Previous studies have employed tedious manual identification procedures. This study aims to streamline dryline identification by developing an automated, multiparameter algorithm, which applies image-processing and pattern recognition techniques to various meteorological fields and their gradients to identify drylines. The algorithm is applied to five years of high-resolution 24-h forecasts from Weather Research and Forecasting (WRF) Model simulations valid April–June 2007–11. Manually identified dryline positions, which were available from a previous study using the same dataset, are used as truth to evaluate the algorithm performance. Generally, the algorithm performed very well. High probability of detection (POD) scores indicated that the majority of drylines were identified by the method. However, a relatively high false alarm ratio (FAR) was also found, indicating that a large number of nondryline features were also identified. Preliminary use of random forests (a machine learning technique) significantly decreased the FAR, while minimally impacting the POD. The algorithm lays the groundwork for applications including model evaluation and operational forecasting, and should enable efficient analysis of drylines from very large datasets.


Sign in / Sign up

Export Citation Format

Share Document