scientific datasets
Recently Published Documents


TOTAL DOCUMENTS

87
(FIVE YEARS 19)

H-INDEX

11
(FIVE YEARS 3)

2021 ◽  
Author(s):  
David Krasowska ◽  
Julie Bessac ◽  
Robert Underwood ◽  
Jon C. Calhoun ◽  
Sheng Di ◽  
...  

2021 ◽  
Author(s):  
Mandana Mazaheri ◽  
Gregory Kiar ◽  
Tristan Glatard

Algorithms ◽  
2021 ◽  
Vol 14 (10) ◽  
pp. 285
Author(s):  
Hao-Yi Yang ◽  
Zhi-Rong Lin ◽  
Ko-Chih Wang

The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach. Distribution-based approaches often transform a scientific dataset into many distributions, each of which is calculated from a small number of samples. Most of the proposed parallel algorithms focus on modeling single distributions from many input samples efficiently, but these may not fit the large-scale scientific data processing scenario because they cannot utilize computing resources effectively. Histograms and the Gaussian Mixture Model (GMM) are the most popular distribution representations used to model scientific datasets. Therefore, we propose the use of multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures. We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.


2021 ◽  
Vol 13 (3) ◽  
pp. 1-74
Author(s):  
Manpreet Singh Katari ◽  
Sudarshini Tyagi ◽  
Dennis Shasha

Author(s):  
Juliane Müller ◽  
Boris Faybishenko ◽  
Deborah Agarwal ◽  
Stephen Bailey ◽  
Chongya Jiang ◽  
...  
Keyword(s):  

Author(s):  
Ryan Abernathey ◽  
Tom Augspurger ◽  
Anderson Banihirwe ◽  
Charles C Blackmon-Luca ◽  
Timothy J Crone ◽  
...  

Scientific data has traditionally been distributed via downloads from data server to local computer. This way of working suffers from limitations as scientific datasets grow towards the petabyte scale. A “cloud-native data repository,” as defined in this paper, offers several advantages over traditional data repositories—performance, reliability, cost-effectiveness, collaboration, reproducibility, creativity, downstream impacts, and access & inclusion. These objectives motivate a set of best practices for cloud-native data repositories: analysis-ready data, cloud-optimized (ARCO) formats, and loose coupling with data-proximate computing. The Pangeo Project has developed a prototype implementation of these principles by using open-source scientific Python tools. By providing an ARCO data catalog together with on-demand, scalable distributed computing, Pangeo enables users to process big data at rates exceeding 10 GB/s. Several challenges must be resolved in order to realize cloud computing’s full potential for scientific research, such as organizing funding, training users, and enforcing data privacy requirements.


Mathematics ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. 956 ◽  
Author(s):  
Shahryar Rahnamayan ◽  
Sedigheh Mahdavi ◽  
Kalyanmoy Deb ◽  
Azam Asilian Bidgoli

The ranking of multi-metric scientific achievements is a challenging task. For example, the scientific ranking of researchers utilizes two major types of indicators; namely, number of publications and citations. In fact, they focus on how to select proper indicators, considering only one indicator or combination of them. The majority of ranking methods combine several indicators, but these methods are faced with a challenging concern—the assignment of suitable/optimal weights to the targeted indicators. Pareto optimality is defined as a measure of efficiency in the multi-objective optimization which seeks the optimal solutions by considering multiple criteria/objectives simultaneously. The performance of the basic Pareto dominance depth ranking strategy decreases by increasing the number of criteria (generally speaking, when it is more than three criteria). In this paper, a new, modified Pareto dominance depth ranking strategy is proposed which uses some dominance metrics obtained from the basic Pareto dominance depth ranking and some sorted statistical metrics to rank the scientific achievements. It attempts to find the clusters of compared data by using all of indicators simultaneously. Furthermore, we apply the proposed method to address the multi-source ranking resolution problem which is very common these days; for example, there are several world-wide institutions which rank the world’s universities every year, but their rankings are not consistent. As our case studies, the proposed method was used to rank several scientific datasets (i.e., researchers, universities, and countries) for proof of concept.


2020 ◽  
Vol 14 (2) ◽  
pp. 101013 ◽  
Author(s):  
Tong Zeng ◽  
Longfeng Wu ◽  
Sarah Bratt ◽  
Daniel E. Acuna

2020 ◽  
Author(s):  
Robert Huber ◽  
Anusuriya Devaraju ◽  
Michael Diepenbroek ◽  
Uwe Schindler ◽  
Roland Koppe ◽  
...  

<p>Pressing environmental and societal challenges demand the reuse of data on a much larger scale. Central to improvements on this front are approaches that support structured and detailed data descriptions of published data. In general, the reusability of scientific datasets such as measurements generated by instruments, observations collected in the field, and model simulation outputs, require information about the contexts through which they were produced. These contexts include the instrumentation, methods, and analysis software used. In current data curation practice, data providers often put a significant effort in capturing descriptive metadata about datasets. Nonetheless, metadata about instruments and methods provided by data authors are limited, and in most cases are unstructured.</p><p>The ‘Interoperability’ principle of FAIR emphasizes the importance of using formal vocabularies to enable machine-understandability of data and metadata, and establishing links between data and related research entities to provide their contextual information (e.g., devices and methods). To support FAIR data, PANGAEA is currently elaborating workflows to enrich instrument information of scientific datasets utilizing internal as well as third party services and ontologies and their identifiers. This abstract presents our ongoing development within the projects FREYA and FAIRsFAIR as follows:</p><ul><li>Integrating the AWI O2A (Observations to Archives) framework and associated suite of tools within PANGAEA’s curatorial workflow as well as semi-automatized ingestion of observatory data.</li> <li>Linking data with their observation sources (devices) by recording the persistent identifiers (PID) from the O2A sensor registry system (sensor.awi.de) as part of the PANGAEA  instrumentation database.</li> <li>Enriching device and method descriptions of scientific data by annotating them with appropriate vocabularies such as the NERC device type and device vocabularies or scientific methodology classifications.</li> </ul><p>In our contribution we will also outline the challenges to be addressed in enabling FAIR vocabularies of instruments and methods. This includes questions regarding reliability and trustworthiness of third party ontologies and services. Further, challenges in content synchronisation across linked resources and implications on FAIRness levels of data sets such as dependencies on interlinked data sources and vocabularies.</p><p>We will show in how far adapting, harmonizing and controlling the used vocabularies, as well as identifier systems between data provider and data publisher, improves the findability and re-usability of datasets , while keeping the curational overhead a slow as possible. This use case is a valuable example of how improving interoperability through harmonization efforts, though initially problematic and labor intensive, can benefits to a multitude of stakeholders in the long run: data users, publishers, research institutes, and funders.</p>


2020 ◽  
Vol 60 (3) ◽  
pp. 1235-1244 ◽  
Author(s):  
Jian Jiang ◽  
Rui Wang ◽  
Menglun Wang ◽  
Kaifu Gao ◽  
Duc Duy Nguyen ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document