Phyloreferences: Tree-Native, Reproducible, and Machine-Interpretable Taxon Concepts

Evolutionary and organismal biology, similar to other fields in biology, have become inundated with data. At the same rate, we are experiencing a surge in broader evolutionary and ecological syntheses for which tree-thinking is the staple for a variety of post-tree analyses. To fully take advantage of this wealth of data to discover and understand large-scale evolutionary and ecological patterns, computational data integration, i.e. the use of machines to link data at large scale by shared entities, is crucial. The most common shared entity by which evolutionary and ecological data need to be linked is the taxon to which they belong. In this paper, we propose a set of requirements that a system for defining such taxa should meet for computational data science: taxon definitions should maintain conceptual consistency, be reproducible via a known algorithm, be computationally automatable, and be applicable across the tree of life. We argue that Linnaean names based in Linnaean taxonomy, by far the most prevalent means of linking data to taxa, fail to meet these requirements due to fundamental theoretical and practical shortfalls. We argue that for the purposes of data-integration we should instead use phylogenetic clade definitions transformed into formal logic expressions. We call such expressions phyloreferences, and argue that, unlike Linnaean names, they meet all requirements for effective data-integration.

Download Full-text

Connected data landscape of long-term ecological studies: the SPI-Birds data hub

10.32942/osf.io/6gea7 ◽

2020 ◽

Author(s):

Antica Culina ◽

Frank Adriaensen ◽

Liam D. Bailey ◽

Malcolm D. Burgess ◽

Anne Charmantier ◽

...

Keyword(s):

Data Integration ◽

Community Involvement ◽

Large Scale ◽

Lessons Learned ◽

Ecological Data ◽

Ecological Processes ◽

Meta Data ◽

Standard Format ◽

Global Issues

The integration and synthesis of the data in different areas of science is drastically slowed and hindered by a lack of standards and networking programmes. Long-term studies of individually marked animals are not an exception. These studies are especially important as instrumental for understanding evolutionary and ecological processes in the wild. Further, their number and global distribution provides a unique opportunity to assess the generality of patterns and to address broad-scale global issues (e.g. climate change). To solve data integration issues and enable a new scale of ecological and evolutionary research based on long-terms studies of birds, we have created the SPI-Birds Network and Database (www.spibirds.org) – a large-scale initiative that connects data from, and researchers working on, studies of wild populations of individually recognizable (usually ringed) birds. Within a year of the establishment, SPI-Birds counts 120 members working on more than 80 populations, with data concerning breeding attempts of almost a million individual birds over a 1700 cumulative years, and counting. SPI-Birds acts as a data hub and a catalogue of studied populations. It prevents data loss, secures easy data finding, use and integration, and thus facilitates collaboration and synthesis. We provide community-derived data and meta-data standards and improve data integrity guided by of Findable, Accessible, Interoperable, and Reusable (FAIR), and aligned with the existing metadata languages (e.g. ecological meta-data language). The encouraging community involvement stems from SPI-Bird's decentralized approach: research groups retain full control over data use and their way of data management, while SPI-Birds creates tailored pipelines to convert each unique data format into a standard format. We outline the lessons learned, so that other communities (e.g. those working on other taxa) can adapt our successful model. Creating community-specific hubs (such as ours, COMADRE for animal demography, etc.) will aid much-needed large-scale ecological data integration.

Download Full-text

When didactics meet data science: process data analysis in large-scale mathematics assessment in France

Large-scale Assessments in Education ◽

10.1186/s40536-020-00085-y ◽

2020 ◽

Vol 8 (1) ◽

Author(s):

Franck Salles ◽

Reinaldo Dos Santos ◽

Saskia Keskpaik

Keyword(s):

Data Analysis ◽

Large Scale ◽

Data Science ◽

Mathematics Assessment ◽

Process Data ◽

Meet Data

Download Full-text

Large-scale Data Integration for Facilities Analytics: Challenges and Opportunities

2020 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata50022.2020.9378440 ◽

2020 ◽

Author(s):

Balaje T. Thumati ◽

Halasya Siva Subramania ◽

Rajeev Shastri ◽

Karthik Kalyana Kumar ◽

Nicole Hessner ◽

...

Keyword(s):

Data Integration ◽

Large Scale ◽

Large Scale Data ◽

Challenges And Opportunities ◽

Scale Data

Download Full-text

Collective Development of Large Scale Data Science Products via Modularized Assignments

Proceedings of the 51st ACM Technical Symposium on Computer Science Education ◽

10.1145/3328778.3366961 ◽

2020 ◽

Author(s):

Bhavya ◽

Assma Boughoula ◽

Aaron Green ◽

ChengXiang Zhai

Keyword(s):

Large Scale ◽

Data Science ◽

Large Scale Data ◽

Collective Development ◽

Scale Data

Download Full-text

MuSA: a graphical user interface for multi-OMICs data integration in radiogenomic studies

Scientific Reports ◽

10.1038/s41598-021-81200-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Mario Zanfardino ◽

Rossana Castaldo ◽

Katia Pane ◽

Ornella Affinito ◽

Marco Aiello ◽

...

Keyword(s):

User Interface ◽

Data Integration ◽

Graphical User Interface ◽

Data Science ◽

Heterogeneous Data ◽

Biological Information ◽

Omics Data ◽

Correlation Clustering ◽

Downstream Analysis ◽

Omics Data Integration

AbstractAnalysis of large-scale omics data along with biomedical images has gaining a huge interest in predicting phenotypic conditions towards personalized medicine. Multiple layers of investigations such as genomics, transcriptomics and proteomics, have led to high dimensionality and heterogeneity of data. Multi-omics data integration can provide meaningful contribution to early diagnosis and an accurate estimate of prognosis and treatment in cancer. Some multi-layer data structures have been developed to integrate multi-omics biological information, but none of these has been developed and evaluated to include radiomic data. We proposed to use MultiAssayExperiment (MAE) as an integrated data structure to combine multi-omics data facilitating the exploration of heterogeneous data. We improved the usability of the MAE, developing a Multi-omics Statistical Approaches (MuSA) tool that uses a Shiny graphical user interface, able to simplify the management and the analysis of radiogenomic datasets. The capabilities of MuSA were shown using public breast cancer datasets from TCGA-TCIA databases. MuSA architecture is modular and can be divided in Pre-processing and Downstream analysis. The pre-processing section allows data filtering and normalization. The downstream analysis section contains modules for data science such as correlation, clustering (i.e., heatmap) and feature selection methods. The results are dynamically shown in MuSA. MuSA tool provides an easy-to-use way to create, manage and analyze radiogenomic data. The application is specifically designed to guide no-programmer researchers through different computational steps. Integration analysis is implemented in a modular structure, making MuSA an easily expansible open-source software.

Download Full-text

Large-Scale Analysis of Gene Expression and Connectivity in the Rodent Brain: Insights through Data Integration

Frontiers in Neuroinformatics ◽

10.3389/fninf.2011.00012 ◽

2011 ◽

Vol 5 ◽

Cited By ~ 17

Author(s):

Leon French ◽

Powell Patrick Cheng Tan ◽

Paul Pavlidis

Keyword(s):

Gene Expression ◽

Data Integration ◽

Large Scale ◽

Scale Analysis ◽

Rodent Brain ◽

Large Scale Analysis

Download Full-text

Landscape analysis through pictorial transects in degraded lands.

10.5194/egusphere-egu21-14477 ◽

2021 ◽

Author(s):

Juan Antonio Campos ◽

Jaime Villena ◽

Marta M. Moreno ◽

Jesús D. Peco ◽

Mónica Sánchez-Ormeño ◽

...

Keyword(s):

Large Scale ◽

Fine Particles ◽

Plant Cover ◽

Degraded Lands ◽

Ecological Resources ◽

Allochthonous Species ◽

Ecological Patterns ◽

Good Characterization ◽

Photosynthetic Potential ◽

High Degree

Understanding the dynamics of plant populations and their relationship with the characteristics of the terrain (slope, texture, etc.) and with particular phenomena (erosion, pollution, environmental constrains, etc.) that could affect them is crucial in order to manage regeneration and rehabilitation projects in degraded lands. In recent years, the emphasis has been placed on the observation and assessment of microtopographic drivers as they lead to large-scale phenomena. All the ecological variables that affect a given area are interconnected and the success in unraveling the ecological patterns of operation relies on making a good characterization of all the parameters involved.It is especially interesting to study the natural colonization processes that take place in Mediterranean areas with a high degree of seasonality, to whose climatic restrictions, the presence of pollutants and various anthropic actions, can be added. Over these degraded areas, we propose using a new tool, what we have come to call "pictorial transects", that is, one-dimensional artificial transects built from low-scale photographs (2 m2) taken along a line of work (transect) where you can see the points where ecological resources are generated, stored and lost, and their fluctuation throughout time. A derivative of these would be the "green transects" in which the green color has been discriminated using the open software Image I. It is an inexpensive, fast and straightforward pictorial method that can be used to research and monitor the spatial and temporal fluctuation of the potential input of resources (organic matter, water, fine particles, etc.) to the ecosystem.The information obtained from pictorial transects not only refers to the measurement of the photosynthetic potential per unit area or the location of the critical points (generate, storage or sink of resources) but also makes it possible to monitor the specific composition of the plant cover. For an appropriate use of this methodology, the criteria to determine the direction and length of the different transects must be previously and carefully established according to the objectives proposed in the study. For example: a radial transect in a salty pond will give us information on the changes in the plant cover as we move away from the center and the salinity decreases. In the same pond, a transect parallel to the shore will give us information on those changes that occur in the vegetation that do not depend on the degree of salinity. There are some cases in which this method could be very useful, as in the natural colonization of a degraded mine site or to assess the progression area affected by allochthonous species or weeds in extensive crops.

Download Full-text

Benchmarking driving efficiency using data science techniques applied on large - scale smartphone data

10.12681/eadd/44854 ◽

2018 ◽

Author(s):

Δημήτριος Τσελέντης

Keyword(s):

Data Envelopment Analysis ◽

Convex Hull ◽

Large Scale ◽

Data Science ◽

Data Envelopment ◽

Using Data

Ο κύριος στόχος της παρούσας διδακτορικής διατριβής είναι η ανάπτυξη μιας ολοκληρωμένης μεθοδολογικής προσέγγισης για τη συγκριτική αξιολόγηση της οδηγικής επίδοσης, όσον αφορά την οδική ασφάλεια, τόσο σε επίπεδο διαδρομής, όσο και οδηγού, με τη χρήση τεχνικών της επιστήμης δεδομένων. Η μεθοδολογική προσέγγιση στηρίζεται στον καθορισμό ενός δείκτη επίδοσης που βασίζεται στη θεωρία της Περιβάλλουσας Ανάλυσης Δεδομένων (Data Envelopment Analysis - DEA) και σχετίζεται με μακροσκοπικά συμπεριφοριστικά χαρακτηριστικά οδήγησης, όπως ο αριθμός των απότομων επιταχύνσεων/ επιβραδύνσεων, ο χρόνος χρήσης του κινητού τηλεφώνου και ο χρόνος υπέρβασης του ορίου ταχύτητας. Ακόμα, αναπτύσσονται μοντέλα μηχανικής μάθησης για τον προσδιορισμό διακριτών προφίλ οδήγησης που βασίζονται στη χρονική εξέλιξη της οδηγικής επίδοσης. Η προτεινόμενη μεθοδολογική προσέγγιση εφαρμόζεται σε πραγματικά δεδομένα οδήγησης ευρείας κλίμακας που συλλέγονται από έξυπνες συσκευές κινητών τηλεφώνων (smartphones), τα οποία αναλύονται μέσω στατιστικών μεθόδων για τον προσδιορισμό της απαιτούμενης ποσότητας δεδομένων οδήγησης που θα χρησιμοποιηθούν στην ανάλυση. Τα αποτελέσματα δείχνουν ότι ο βελτιστοποιημένος αλγόριθμος convex hull – DEA δίνει εξίσου ακριβή και ταχύτερα αποτελέσματα σε σχέση με τις κλασικές προσεγγίσεις της DEA. Ακόμα, η μεθοδολογία επιτρέπει τον προσδιορισμό των λιγότερο αποδοτικών ταξιδιών σε μια βάση δεδομένων καθώς και το αποδοτικό επίπεδο οδηγικών στοιχείων ενός ταξιδιού για να καταστεί αποδοτικότερη από την άποψη της ασφάλειας. Η περαιτέρω ομάδοποίηση των οδηγών με βάση της απόδοσή τους σε βάθος χρόνου οδηγεί στον εντοπισμό τριών ομάδων οδηγών, αυτή του μέσου οδηγού, του ασταθή οδηγού και του λιγότερο επικίνδυνου οδηγού. Τα αποτελέσματα δείχνουν ότι η εκ των προτέρων γνώση σχετικά με το ιστορικό ατυχημάτων του χρήστη φαίνεται να επηρεάζουν μόνο τη σύσταση της δεύτερης συστάδας των πιο ασταθών οδηγών, η οποία ενσωματώνει τους οδηγούς που είναι λιγότερο αποδοτικοί και ασταθής ως προς την ασφάλεια. Φαίνεται επίσης ότι η χρήση κινητών τηλεφώνων δεν αποτελεί κρίσιμο παράγοντα για τον καθορισμό της επίδοσης της ασφάλειας ενός οδηγού, καθώς διαπιστώθηκαν μικρές διαφορές σε σχέση με αυτό το χαρακτηριστική οδήγησης μεταξύ οδηγών διαφορετικών κατηγοριών επίδοσης. Επιπλέον, δείχνεται ότι απαιτείται μια διαφορετική δειγματοληψίας δεδομένων οδήγησης για κάθε α) οδικό τύπο, β) χαρακτηριστικό οδήγησης και γ) οδηγική επιθετικότητα για να συγκεντρωθούν αρκετά δεδομένα και να αποκτηθεί μια σαφής εικόνα της οδηγικής συμπεριφοράς και να εκτελεστεί ανάλυση με χρήση DEA. Τα αποτελέσματα θα μπορούσαν να αξιοποιηθούν για την παροχή εξατομικευμένης ανατροφοδότησης στους οδηγούς σχετικά με τη συνολική τους οδηγική επίδοση και την εξέλιξή της, προκειμένου να βελτιωθεί και να μειωθεί ο κίνδυνος ατυχήματος.

Download Full-text

Feasibility and Evaluation of a Large-Scale External Validation Approach for Patient-Level Prediction in an International Data Network: Validation of models predicting stroke in female patients newly diagnosed with atrial fibrillation.

10.21203/rs.2.11750/v2 ◽

2020 ◽

Author(s):

Jenna Marie Reps ◽

Ross Williams ◽

Seng Chan You ◽

Thomas Falconer ◽

Evan Minty ◽

...

Keyword(s):

Atrial Fibrillation ◽

Large Scale ◽

Data Science ◽

Prediction Models ◽

External Validation ◽

Scale Up ◽

R Package ◽

Prognostic Models ◽

Healthcare Data ◽

Patient Level

Abstract Objective: To demonstrate how the Observational Healthcare Data Science and Informatics (OHDSI) collaborative network and standardization can be utilized to scale-up external validation of patient-level prediction models by enabling validation across a large number of heterogeneous observational healthcare datasets.Materials & Methods: Five previously published prognostic models (ATRIA, CHADS2, CHADS2VASC, Q-Stroke and Framingham) that predict future risk of stroke in patients with atrial fibrillation were replicated using the OHDSI frameworks. A network study was run that enabled the five models to be externally validated across nine observational healthcare datasets spanning three countries and five independent sites. Results: The five existing models were able to be integrated into the OHDSI framework for patient-level prediction and they obtained mean c-statistics ranging between 0.57-0.63 across the 6 databases with sufficient data to predict stroke within 1 year of initial atrial fibrillation diagnosis for females with atrial fibrillation. This was comparable with existing validation studies. The validation network study was run across nine datasets within 60 days once the models were replicated. An R package for the study was published at https://github.com/OHDSI/StudyProtocolSandbox/tree/master/ExistingStrokeRiskExternalValidation.Discussion: This study demonstrates the ability to scale up external validation of patient-level prediction models using a collaboration of researchers and a data standardization that enable models to be readily shared across data sites. External validation is necessary to understand the transportability or reproducibility of a prediction model, but without collaborative approaches it can take three or more years for a model to be validated by one independent researcher. Conclusion : In this paper we show it is possible to both scale-up and speed-up external validation by showing how validation can be done across multiple databases in less than 2 months. We recommend that researchers developing new prediction models use the OHDSI network to externally validate their models.

Download Full-text

Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis

Advances in Statistics ◽

10.1155/2014/502678 ◽

2014 ◽

Vol 2014 ◽

pp. 1-19 ◽

Cited By ~ 6

Author(s):

Mark J. van der Laan ◽

Richard J. C. M. Starmans

Keyword(s):

Big Data ◽

Data Science ◽

State Of The Art ◽

Accurate Estimation ◽

Learning Tools ◽

New Developments ◽

Data Movement ◽

Computational Data ◽

Adaptive Estimators ◽

Data Adaptive

This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.

Download Full-text