Nonparametric data science: Testing hypotheses in large complex data

Author(s):  
Sunil Mathur

The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data Science. In this new domain of ‘meaning’ extracted from the data, the increasing amount of produced and available data in databases, nowadays, has brought new challenges. That involves different fields of statistics, machine learning, information and computer science, optimization, pattern recognition. These afford together a considerable contribute in the analysis of ‘Big data’, open data, relational and complex data, structured and no-structured. The interest is to collect the contributes which provide from the different domains of Statistics, in the high dimensional data quality validation, sampling extraction, dimensional reduction, pattern selection, data modelling, testing hypotheses and confirming conclusions drawn from the data.


REGION ◽  
2020 ◽  
Vol 7 (1) ◽  
pp. 21-34 ◽  
Author(s):  
Jonathan Reades

The proliferation of large, complex data spatial data sets presents challenges to the way that regional science—and geography more widely—is researched and taught. Increasingly, it is not ‘just’ quantitative skills that are needed, but computational ones. However, the majority of undergraduate programmes have yet to offer much more than a one-off ‘GIS programming’ class since such courses are seen as challenging not only for students to take, but for staff to deliver. Using evaluation criterion of minimal complexity, maximal flexibility, interactivity, utility, and maintainability, we show how the technical features of Jupyter notebooks—particularly when combined with the popularity of Anaconda Python and Docker—enabled us to develop and deliver a suite of three ‘geocomputation’ modules to Geography undergraduates, with some progressing to data science and analytics roles.


Author(s):  
Ryan Hafen ◽  
Luke Gosink ◽  
Jason McDermott ◽  
Karin Rodland ◽  
Kerstin Kleese-Van Dam ◽  
...  
Keyword(s):  

2019 ◽  
Vol 16 (2) ◽  
pp. 729-734
Author(s):  
Krishna P. Chaitanya ◽  
Nagabhushana M. Rao

Data Science has been developed as an impressive new scientific field, related disputes and considerations have pursued to address why science in general needs data science. However, few such arguments concern the intrinsic complexities and intelligence in data science. As data science pay attention to efficient understanding of complex data and business related problems. The primary objective of data science is exploration of the complexities. Among these complexities, Environmental complexities are an important factor. By using algorithms, these complexities can be reduced.


2013 ◽  
Author(s):  
Ryan Hafen ◽  
Luke Gosink ◽  
Jason McDermott ◽  
Karin Rodland ◽  
Kerstin Kleese-Van Dam ◽  
...  
Keyword(s):  

2021 ◽  
Author(s):  
Alice Fremand

<p>Open data is not a new concept. Over sixty years ago in 1959, knowledge sharing was at the heart of the Antarctic Treaty which included in article III 1c the statement: “scientific observations and results from Antarctica shall be exchanged and made freely available”. ​At a similar time, the World Data Centre (WDC) system was created to manage and distribute the data collected from the International Geophysical Year (1957-1958) led by the International Council of Science (ICSU) building the foundations of today’s research data management practices.</p><p>What about now? The WDC system still exists through the World Data System (WDS). Open data has been endorsed by a majority of funders and stakeholders. Technology has dramatically evolved. And the profession of data manager/curator has emerged. Utilising their professional expertise means that their role is far wider than the long-term curation and publication of data sets.</p><p>Data managers are involved in all stages of the data life cycle: from data management planning, data accessioning to data publication and re-use. They implement open data policies; help write data management plans and provide advice on how to manage data during, and beyond the life of, a science project. In liaison with software developers as well as scientists, they are developing new strategies to publish data either via data catalogues, via more sophisticated map-based viewer services or in machine-readable form via APIs. Often, they bring the expertise of the field they are working in to better assist scientists satisfy Findable, Accessible, Interoperable and Re-usable (FAIR) principles. Recent years have seen the development of a large community of experts that are essential to share, discuss and set new standards and procedures. The data are published to be re-used, and data managers are key to promoting high-quality datasets and participation in large data compilations.</p><p>To date, there is no magical formula for FAIR data. The Research Data Alliance is a great platform allowing data managers and researchers to work together, develop and adopt infrastructure that promotes data-sharing and data-driven research. However, the challenge to properly describe each data set remains. Today, scientists are expecting more and more from their data publication or data requests: they want interactive maps, they want more complex data systems, they want to query data, combine data from different sources and publish them rapidly.  By developing new procedures and standards, and looking at new technologies, data managers help set the foundations to data science.</p>


2016 ◽  
Vol 35 (10) ◽  
pp. 906-909 ◽  
Author(s):  
Brendon Hall

There has been much excitement recently about big data and the dire need for data scientists who possess the ability to extract meaning from it. Geoscientists, meanwhile, have been doing science with voluminous data for years, without needing to brag about how big it is. But now that large, complex data sets are widely available, there has been a proliferation of tools and techniques for analyzing them. Many free and open-source packages now exist that provide powerful additions to the geoscientist's toolbox, much of which used to be only available in proprietary (and expensive) software platforms.


2020 ◽  
Vol 91 (3) ◽  
pp. 1804-1812 ◽  
Author(s):  
Jonathan MacCarthy ◽  
Omar Marcillo ◽  
Chad Trabant

Abstract Data-intensive research in seismology is experiencing a recent boom, driven in part by large volumes of available data and advances in the growing field of data science. However, there are significant barriers to processing large data volumes, such as long retrieval times from data repositories, complex data management, and limited computational resources. New tools and platforms have reduced the barriers to entry for scientific cluster computing, including the maturation of the commercial cloud as an accessible instrument for research. In this work, we build a customized research cluster in the cloud to test a new workflow for large-scale seismic analysis, in which data are processed as a stream (retrieved on-the-fly and acted upon without storing), with data from the Incorporated Research Institutions for Seismology Data Management Center. We use this workflow to deploy a spectral peak detection algorithm over 5.6 TB of compressed continuous seismic data from 2074 stations of the USArray Transportable Array EarthScope network. Using a 50-node cluster in the cloud, we completed the noise survey in 80 hr, with an average data throughput of 1.7 GB per minute. By varying cluster sizes, we find the scaling of our analysis to be sublinear, due to a combination of algorithmic limitations and data center response times. The cloud-based streaming workflow represents an order-of-magnitude increase in acquisition and processing speed compared to a traditional download-store-process workflow, and offers the additional benefits of employing a flexible, accessible, and widely used computing architecture. It is limited, however, due to its reliance on Internet transfer speeds and data center service capacity, and may not work well for repeated analyses or those for which even higher data throughputs are needed. These research applications will require a new class of cloud-native approaches in which both data and analysis are in the cloud.


Author(s):  
Massimo Brescia ◽  
Stefano Cavuoti ◽  
Oleksandra Razim ◽  
Valeria Amaro ◽  
Giuseppe Riccio ◽  
...  

The importance of the current role of data-driven science is constantly increasing within Astrophysics, due to the huge amount of multi-wavelength data collected every day, characterized by complex and high-volume information requiring efficient and, as much as possible, automated exploration tools. Furthermore, to accomplish main and legacy science objectives of future or incoming large and deep survey projects, such as James Webb Space Telescope (JWST), James Webb Space Telescope (LSST), and Euclid, a crucial role is played by an accurate estimation of photometric redshifts, whose knowledge would permit the detection and analysis of extended and peculiar sources by disentangling low-z from high-z sources and would contribute to solve the modern cosmological discrepancies. The recent photometric redshift data challenges, organized within several survey projects, like LSST and Euclid, pushed the exploitation of the observed multi-wavelength and multi-dimensional data or ad hoc simulated data to improve and optimize the photometric redshifts prediction and statistical characterization based on both Spectral Energy Distribution (SED) template fitting and machine learning methodologies. They also provided a new impetus in the investigation of hybrid and deep learning techniques, aimed at conjugating the positive peculiarities of different methodologies, thus optimizing the estimation accuracy and maximizing the photometric range coverage, which are particularly important in the high-z regime, where the spectroscopic ground truth is poorly available. In such a context, we summarize what was learned and proposed in more than a decade of research.


Sign in / Sign up

Export Citation Format

Share Document