Saraga: Open Datasets for Research on Indian Art Music

We introduce two large open data collections of Indian Art Music, both its Carnatic and Hindustani traditions, comprising audio from vocal concerts, editorial metadata, and time-aligned melody, rhythm, and structure annotations. Shared under Creative Commons licenses, they currently form the largest annotated data collections available for computational analysis of Indian Art Music. The collections are intended to provide audio and ground truth for several music information research tasks and large-scale data-driven analysis in musicological studies. A part of the Saraga Carnatic collection also has multitrack recordings, making it a valuable collection for research on melody extraction, source separation, automatic mixing, and performance analysis. We describe the tenets and the process of collection, annotation, and organization of the data. We provide easy access to the audio, metadata, and the annotations in the collections through an API, along with a companion website that has example scripts to facilitate access and use of the data. To sustain and grow the collections, we provide a mechanism for both the research and music community to contribute additional data and annotations to the collections. We also present applications with the collections for music education, understanding, exploration, and discovery.

Download Full-text

OpenCitations, an infrastructure organization for open scholarship

Quantitative Science Studies ◽

10.1162/qss_a_00023 ◽

2020 ◽

Vol 1 (1) ◽

pp. 428-444 ◽

Cited By ~ 11

Author(s):

Silvio Peroni ◽

David Shotton

Keyword(s):

Semantic Web ◽

Large Scale ◽

Open Data ◽

Data Sets ◽

Citation Data ◽

Semantic Web Technologies ◽

Web Technologies ◽

Creative Commons ◽

Citation Indexes ◽

Description Framework

OpenCitations is an infrastructure organization for open scholarship dedicated to the publication of open citation data as Linked Open Data using Semantic Web technologies, thereby providing a disruptive alternative to traditional proprietary citation indexes. Open citation data are valuable for bibliometric analysis, increasing the reproducibility of large-scale analyses by enabling publication of the source data. Following brief introductions to the development and benefits of open scholarship and to Semantic Web technologies, this paper describes OpenCitations and its data sets, tools, services, and activities. These include the OpenCitations Data Model; the SPAR (Semantic Publishing and Referencing) Ontologies; OpenCitations’ open software of generic applicability for searching, browsing, and providing REST APIs over resource description framework (RDF) triplestores; Open Citation Identifiers (OCIs) and the OpenCitations OCI Resolution Service; the OpenCitations Corpus (OCC), a database of open downloadable bibliographic and citation data made available in RDF under a Creative Commons public domain dedication; and the OpenCitations Indexes of open citation data, of which the first and largest is COCI, the OpenCitations Index of Crossref Open DOI-to-DOI Citations, which currently contains over 624 million bibliographic citations and is receiving considerable usage by the scholarly community.

Download Full-text

KNOWLEDGE DISCOVERY IN DATA GRID WITH ADVANCE RESERVATION FOR METEOROLOGY

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2015.1299 ◽

2015 ◽

pp. 174-179

Author(s):

M. THANGAMANI ◽

J.MALAR VIZHI ◽

G. NANDHINI

Keyword(s):

Large Scale ◽

Weather Forecasting ◽

Meteorological Data ◽

Weather Prediction ◽

Data Grid ◽

Easy Access ◽

Distributed Data ◽

Community Resources ◽

Advance Reservation ◽

Data Collections

In weather forecasting as well as in other scientific domains, large-scale and distributed data collections are emerging as critical community resources. With the development of Grid technologies, data management and sharing can be exploited in such an efficient way. In this paper, we present our approach in constructing a portal-based Meteorological Data Grid System with first application in Weather Forecasting .Our system architecture has three layers. The first layer is Modeling System that uses Numerical Weather Prediction (NWP) models to generate forecast data automatically in GRIB/NetCDF format. The second layer is a Data Grid which provides users with secure and easy access to distributed meteorological datasets. It also addresses authentication / authorization for secure transfers, mechanisms for scalable data replication, and technologies for searching relevant datasets regarding metadata provided by users. A Grid Portal, the third layer, is built for purpose of easy using the system. We also allow advance resource reservation to have exclusive access to Grid resources.

Download Full-text

Model and Method for Contributor’s Quality Assessment in Community Image Tagging Systems

Information and Control Systems ◽

10.31799/1684-8853-2018-4-45-51 ◽

2018 ◽

pp. 45-51

Author(s):

A. V. Ponomarev

Keyword(s):

Large Scale ◽

Wide Spectrum ◽

Preference Relation ◽

Pairwise Comparison ◽

Ground Truth ◽

Comparison Method ◽

Characteristic Matrix ◽

Image Tagging ◽

Proposed Model

Introduction: Large-scale human-computer systems involving people of various skills and motivation into the information processing process are currently used in a wide spectrum of applications. An acute problem in such systems is assessing the expected quality of each contributor; for example, in order to penalize incompetent or inaccurate ones and to promote diligent ones.Purpose: To develop a method of assessing the expected contributor’s quality in community tagging systems. This method should only use generally unreliable and incomplete information provided by contributors (with ground truth tags unknown).Results:A mathematical model is proposed for community image tagging (including the model of a contributor), along with a method of assessing the expected contributor’s quality. The method is based on comparing tag sets provided by different contributors for the same images, being a modification of pairwise comparison method with preference relation replaced by a special domination characteristic. Expected contributors’ quality is evaluated as a positive eigenvector of a pairwise domination characteristic matrix. Community tagging simulation has confirmed that the proposed method allows you to adequately estimate the expected quality of community tagging system contributors (provided that the contributors' behavior fits the proposed model).Practical relevance: The obtained results can be used in the development of systems based on coordinated efforts of community (primarily, community tagging systems).

Download Full-text

A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research—An International Collaboration

Epidemiologia ◽

10.3390/epidemiologia2030024 ◽

2021 ◽

Vol 2 (3) ◽

pp. 315-324

Author(s):

Juan M. Banda ◽

Ramya Tekumalla ◽

Guanyu Wang ◽

Jingyuan Yu ◽

Tuo Liu ◽

...

Keyword(s):

Large Scale ◽

Social Dynamics ◽

Additional Data ◽

Open Data ◽

Data Sources ◽

Research Projects ◽

Research Groups ◽

The World ◽

Data Source

As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others.

Download Full-text

Correcting for experiment-specific variability in expression compendia can remove underlying signals

GigaScience ◽

10.1093/gigascience/giaa117 ◽

2020 ◽

Vol 9 (11) ◽

Author(s):

Alexandra J Lee ◽

YoSon Park ◽

Georgia Doing ◽

Deborah A Hogan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Large Scale ◽

Original Signal ◽

Batch Effects ◽

Technical Variability ◽

The Past ◽

Statistical Correction ◽

Before And After ◽

Data Collections ◽

Biological Patterns

Abstract Motivation In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. Objective We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. Method We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. Results The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. Conclusion When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.

Download Full-text

Building Damage Detection Using U-Net with Attention Mechanism from Pre- and Post-Disaster Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs13050905 ◽

2021 ◽

Vol 13 (5) ◽

pp. 905

Author(s):

Chuyi Wu ◽

Feng Zhang ◽

Junshi Xia ◽

Yichen Xu ◽

Guoqing Li ◽

...

Keyword(s):

Damage Assessment ◽

Large Scale ◽

Binary Classification ◽

Open Data ◽

Building Damage ◽

Attention Mechanism ◽

Large Scale Dataset ◽

Data Program ◽

The Impact ◽

Post Disaster

The building damage status is vital to plan rescue and reconstruction after a disaster and is also hard to detect and judge its level. Most existing studies focus on binary classification, and the attention of the model is distracted. In this study, we proposed a Siamese neural network that can localize and classify damaged buildings at one time. The main parts of this network are a variety of attention U-Nets using different backbones. The attention mechanism enables the network to pay more attention to the effective features and channels, so as to reduce the impact of useless features. We train them using the xBD dataset, which is a large-scale dataset for the advancement of building damage assessment, and compare their result balanced F (F1) scores. The score demonstrates that the performance of SEresNeXt with an attention mechanism gives the best performance, with the F1 score reaching 0.787. To improve the accuracy, we fused the results and got the best overall F1 score of 0.792. To verify the transferability and robustness of the model, we selected the dataset on the Maxar Open Data Program of two recent disasters to investigate the performance. By visual comparison, the results show that our model is robust and transferable.

Download Full-text

Detection of heat pumps from smart meter and open data

Energy Informatics ◽

10.1186/s42162-020-00124-6 ◽

2020 ◽

Vol 3 (S1) ◽

Author(s):

Andreas Weigert ◽

Konstantin Hopf ◽

Nicolai Weinig ◽

Thorsten Staake

Keyword(s):

Heat Pump ◽

Open Data ◽

Ground Truth ◽

Heat Pumps ◽

Geographical Information ◽

Weather Data ◽

Smart Meters ◽

Ground Truth Data ◽

Thermal Reservoir ◽

Grid Operators

Abstract Heat pumps embody solutions that heat or cool buildings effectively and sustainably, with zero emissions at the place of installation. As they pose significant load on the power grid, knowledge on their existence is crucial for grid operators, e.g., to forecast load and to plan grid operation. Further details, like the thermal reservoir (ground or air source) or the age of a heat pump installation renders energy-related services possible that utility companies can offer in the future (e.g., detecting wrongly calibrated installations, household energy efficiency checks). This study investigates the prediction of heat pump installations, their thermal reservoir and age. For this, we obtained a dataset with 397 households in Switzerland, all equipped with smart meters, collected ground truth data on installed heat pumps and enriched this data with weather data and geographical information. Our investigation replicates the state of the art in the area of heat pump detection and goes beyond it, as we obtain three major findings: First, machine learning can detect the existence of heat pumps with an AUC performance metric of 0.82, their heat reservoir with an AUC of 0.86, and their age with an AUC of 0.73. Second, heat pump existence can be better detected using data during the heating period than during summer. Third the number of training samples to detect the existence of heat pumps must not be necessarily large in terms of the number of training instances and observation period.

Download Full-text

Network community structure of substorms using SuperMAG magnetometers

Nature Communications ◽

10.1038/s41467-021-22112-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

L. Orr ◽

S. C. Chapman ◽

J. W. Gjerloev ◽

W. Guo

Keyword(s):

Large Scale ◽

Three Dimensional ◽

Easy Access ◽

Directed Network ◽

Coherent System ◽

Dimensional System ◽

Ionospheric Currents ◽

Substorm Current Wedge ◽

Spatially Extended ◽

Current Wedge

AbstractGeomagnetic substorms are a global magnetospheric reconfiguration, during which energy is abruptly transported to the ionosphere. Central to this are the auroral electrojets, large-scale ionospheric currents that are part of a larger three-dimensional system, the substorm current wedge. Many, often conflicting, magnetospheric reconfiguration scenarios have been proposed to describe the substorm current wedge evolution and structure. SuperMAG is a worldwide collaboration providing easy access to ground based magnetometer data. Here we show application of techniques from network science to analyze data from 137 SuperMAG ground-based magnetometers. We calculate a time-varying directed network and perform community detection on the network, identifying locally dense groups of connections. Analysis of 41 substorms exhibit robust structural change from many small, uncorrelated current systems before substorm onset, to a large spatially-extended coherent system, approximately 10 minutes after onset. We interpret this as strong indication that the auroral electrojet system during substorm expansions is inherently a large-scale phenomenon and is not solely due to many meso-scale wedgelets.

Download Full-text

Development of a flow-based planning support system based on open data for the City of Atlanta

Environment and Planning B Urban Analytics and City Science ◽

10.1177/2399808317705881 ◽

2017 ◽

Vol 46 (2) ◽

pp. 207-224

Author(s):

Ge Zhang ◽

Wenwen Zhang ◽

Subhrajit Guhathakurta ◽

Nisha Botchwey

Keyword(s):

Support System ◽

Open Data ◽

Community Planning ◽

Relevant Information ◽

Easy Access ◽

Data Movement ◽

Planning Support ◽

Planning Support System ◽

Analytical Tools ◽

The City

Open data have come of age with many cities, states, and other jurisdictions joining the open data movement by offering relevant information about their communities for free and easy access to the public. Despite the growing volume of open data, their use has been limited in planning scholarship and practice. The bottleneck is often the format in which the data are available and the organization of such data, which may be difficult to incorporate in existing analytical tools. The overall goal of this research is to develop an open data-based community planning support system that can collect related open data, analyze the data for specific objectives, and visualize the results to improve usability. To accomplish this goal, this study undertakes three research tasks. First, it describes the current state of open data analysis efforts in the community planning field. Second, it examines the challenges analysts experience when using open data in planning analysis. Third, it develops a new flow-based planning support system for examining neighborhood quality of life and health for the City of Atlanta as a prototype, which addresses many of these open data challenges.

Download Full-text

Lake-effect rains over Lake Victoria and their association with Mesoscale Convective Systems

Journal of Hydrometeorology ◽

10.1175/jhm-d-20-0244.1 ◽

2021 ◽

Author(s):

Sharon E. Nicholson ◽

Douglas Klotter ◽

Adam T. Hartman

Keyword(s):

Large Scale ◽

Lake Victoria ◽

Ground Truth ◽

Mesoscale Convective Systems ◽

Convective Systems ◽

Lake Effect ◽

Mesoscale Convective ◽

Strong Convection ◽

Short Rains ◽

Lake Catchment

AbstractThis article examined rainfall enhancement over Lake Victoria. Estimates of over-lake rainfall were compared with rainfall in the surrounding lake catchment. Four satellite products were initially tested against estimates based on gauges or water balance models. These included TRMM 3B43, IMERG V06 Final Run (IMERG-F), CHIRPS2, and PERSIANN-CDR. There was agreement among the satellite products for catchment rainfall but a large disparity among them for over-lake rainfall. IMERG-F was clearly an outlier, exceeding the estimate from TRMM 3B43 by 36%. The overestimation by IMERG-F was likely related to passive microwave assessments of strong convection, such as prevails over Lake Victoria. Overall, TRMM 3B43 showed the best agreement with the "ground truth" and was used in further analyses. Over-lake rainfall was found to be enhanced compared to catchment rainfall in all months. During the March-to-May long rains the enhancement varied between 40% and 50%. During the October-to-December short rains the enhancement varied between 33% and 44%. Even during the two dry seasons the enhancement was at least 20% and over 50% in some months. While the magnitude of enhancement varied from month to month, the seasonal cycle was essentially the same for over-lake and catchment rainfall, suggesting that the dominant influence on over-lake rainfall is the large-scale environment. The association with Mesoscale Convective Systems (MCSs) was also evaluated. The similarity of the spatial patterns of rainfall and MCS count each month suggested that these produced a major share of rainfall over the lake. Similarity in interannual variability further supported this conclusion.

Download Full-text