Development of the Dataset Searcher Webapp for finding data on the Belle II computing grid

In any large scale scientific experiment involving enormous quantities of data it is crucial that everyone involved has quick and easy access to all the relevant datasets for their research. By the end of the run time of the Belle II experiment there will be a projected 50 ab−1 of integrated luminosity making it no exception. Until now the only method for locating data of interest was by looking up hand written tables that needed to be regularly updated. In this paper, a new webapp built on the DIRAC software framework will be presented which aims to be the new standard for not only locating data but also storing all its associated metadata.

Download Full-text

MLaaS4HEP: Machine Learning as a Service for HEP

Computing and Software for Big Science ◽

10.1007/s41781-021-00061-3 ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Valentin Kuznetsov ◽

Luca Giommi ◽

Daniele Bonacorsi

Keyword(s):

Machine Learning ◽

Large Scale ◽

High Energy Physics ◽

Modular Design ◽

High Energy ◽

Easy Access ◽

Data Streaming ◽

Custom Made ◽

Computing Grid ◽

Data Inference

AbstractMachine Learning (ML) will play a significant role in the success of the upcoming High-Luminosity LHC (HL-LHC) program at CERN. An unprecedented amount of data at the exascale will be collected by LHC experiments in the next decade, and this effort will require novel approaches to train and use ML models. In this paper, we discuss a Machine Learning as a Service pipeline for HEP (MLaaS4HEP) which provides three independent layers: a data streaming layer to read High-Energy Physics (HEP) data in their native ROOT data format; a data training layer to train ML models using distributed ROOT files; a data inference layer to serve predictions using pre-trained ML models via HTTP protocol. Such modular design opens up the possibility to train data at large scale by reading ROOT files from remote storage facilities, e.g., World-Wide LHC Computing Grid (WLCG) infrastructure, and feed the data to the user’s favorite ML framework. The inference layer implemented as TensorFlow as a Service (TFaaS) may provide an easy access to pre-trained ML models in existing infrastructure and applications inside or outside of the HEP domain. In particular, we demonstrate the usage of the MLaaS4HEP architecture for a physics use-case, namely, the $$t{\bar{t}}$$ t t ¯ Higgs analysis in CMS originally performed using custom made Ntuples. We provide details on the training of the ML model using distributed ROOT files, discuss the performance of the MLaaS and TFaaS approaches for the selected physics analysis, and compare the results with traditional methods.

Download Full-text

Network community structure of substorms using SuperMAG magnetometers

Nature Communications ◽

10.1038/s41467-021-22112-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

L. Orr ◽

S. C. Chapman ◽

J. W. Gjerloev ◽

W. Guo

Keyword(s):

Large Scale ◽

Three Dimensional ◽

Easy Access ◽

Directed Network ◽

Coherent System ◽

Dimensional System ◽

Ionospheric Currents ◽

Substorm Current Wedge ◽

Spatially Extended ◽

Current Wedge

AbstractGeomagnetic substorms are a global magnetospheric reconfiguration, during which energy is abruptly transported to the ionosphere. Central to this are the auroral electrojets, large-scale ionospheric currents that are part of a larger three-dimensional system, the substorm current wedge. Many, often conflicting, magnetospheric reconfiguration scenarios have been proposed to describe the substorm current wedge evolution and structure. SuperMAG is a worldwide collaboration providing easy access to ground based magnetometer data. Here we show application of techniques from network science to analyze data from 137 SuperMAG ground-based magnetometers. We calculate a time-varying directed network and perform community detection on the network, identifying locally dense groups of connections. Analysis of 41 substorms exhibit robust structural change from many small, uncorrelated current systems before substorm onset, to a large spatially-extended coherent system, approximately 10 minutes after onset. We interpret this as strong indication that the auroral electrojet system during substorm expansions is inherently a large-scale phenomenon and is not solely due to many meso-scale wedgelets.

Download Full-text

Discrete Dynamic Shortest Path Problems in Transportation Applications: Complexity and Algorithms with Optimal Run Time

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1645-21 ◽

1998 ◽

Vol 1645 (1) ◽

pp. 170-175 ◽

Cited By ~ 194

Author(s):

Ismail Chabini

Keyword(s):

Large Scale ◽

Shortest Paths ◽

Transportation Systems ◽

Solution Algorithm ◽

Shortest Path Problems ◽

Discrete Dynamic ◽

Solution Algorithms ◽

Dynamic Network Models ◽

Early Results ◽

Run Time

A solution is provided for what appears to be a 30-year-old problem dealing with the discovery of the most efficient algorithms possible to compute all-to-one shortest paths in discrete dynamic networks. This problem lies at the heart of efficient solution approaches to dynamic network models that arise in dynamic transportation systems, such as intelligent transportation systems (ITS) applications. The all-to-one dynamic shortest paths problem and the one-to-all fastest paths problems are studied. Early results are revisited and new properties are established. The complexity of these problems is established, and solution algorithms optimal for run time are developed. A new and simple solution algorithm is proposed for all-to-one, all departure time intervals, shortest paths problems. It is proved, theoretically, that the new solution algorithm has an optimal run time complexity that equals the complexity of the problem. Computer implementations and experimental evaluations of various solution algorithms support the theoretical findings and demonstrate the efficiency of the proposed solution algorithm. The findings should be of major benefit to research and development activities in the field of dynamic management, in particular real-time management, and to control of large-scale ITSs.

Download Full-text

Peningkatan Kualitas Infrastruktur Permukiman Melalui Pemberdayaan Masyarakat Desa/Kelurahan Berbasis Data Base Digital Keruangan (SPASIAL) Di Kabupaten Wajo

JURNAL TEPAT : Applied Technology Journal for Community Engagement and Services ◽

10.25042/jurnal_tepat.v2i1.40 ◽

2019 ◽

Vol 2 (1) ◽

pp. 22-30

Author(s):

Abdul Rachman Rasyid ◽

Andi Lukman Irwan ◽

Laode Muhammad Asfan Mujahid ◽

Ihsan ◽

Mimi Arifin ◽

...

Keyword(s):

Urban Areas ◽

Large Scale ◽

Creative Industries ◽

Easy Access ◽

Small Scale ◽

Silk Industry ◽

Existing Problems ◽

South Sulawesi ◽

Regional Location ◽

Processing Mechanisms

Wajo Regency is one of the districts that have a role in the development and progress of South Sulawesi Province. Therefore, agricultural production facilities will be developed through processing mechanisms to the creative industries. Irrigation will be directed at the development of large-scale and small-scale rural irrigation through artificial embankments, revitalization of swamps and lakes. Whereas in urban areas a residential environment will be held an adjustment, especially near the of Lake Tempe in the area of Sengkang as the Capital of Wajo Regency. The purpose of this study is to find easy access for the community to drinking water and to provide accurate data related to Geographic Information System (GIS)-based regional location conditions. The approach used in this activity is a field survey related to the existing condition of the location by assisting the community, increasing knowledge by training or counseling aimed at solving existing problems in the village / subdistrict in Tempe Subdistrict, Wajo Regency, as well as training and utilizing digital databases related to the profile and potential of the city. The results of the study obtained were that some districts had several problems, namely, solid waste systems, road networks, inadequate buildings and inadequate clean water especially in Attakae, Maddukelleng, Pattirosompe and Tempe. However, there is potential that can be developed to improve the regional economy, such as the silk industry and wood industry.

Download Full-text

Parallelization of a Commercial Streamline Simulator and Performance on Practical Models

SPE Reservoir Evaluation & Engineering ◽

10.2118/118684-pa ◽

2010 ◽

Vol 13 (03) ◽

pp. 383-390 ◽

Cited By ~ 5

Author(s):

R.P.. P. Batycky ◽

M.. Förster ◽

M.R.. R. Thiele ◽

K.. Stüben

Keyword(s):

Large Scale ◽

Programming Model ◽

Scaling Law ◽

Independent Solution ◽

Parallel Execution ◽

Water Model ◽

Test Machine ◽

Multicore Architectures ◽

Streamline Simulation ◽

Run Time

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.

Download Full-text

Recent developments in theCCP-EMsoftware suite

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798317007859 ◽

2017 ◽

Vol 73 (6) ◽

pp. 469-477 ◽

Cited By ~ 101

Author(s):

Tom Burnley ◽

Colin M. Palmer ◽

Martyn Winn

Keyword(s):

Easy Access ◽

Software Framework ◽

Command Line ◽

Command Line Interface ◽

Software Suite ◽

Computational Support ◽

Recent Developments ◽

User Friendly ◽

Different Levels ◽

Friendly Graphical User Interface

As part of its remit to provide computational support to the cryo-EM community, the Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM) has produced a software framework which enables easy access to a range of programs and utilities. The resulting software suite incorporates contributions from different collaborators by encapsulating them in Python task wrappers, which are then made accessibleviaa user-friendly graphical user interface as well as a command-line interface suitable for scripting. The framework includes tools for project and data management. An overview of the design of the framework is given, together with a survey of the functionality at different levels. The currentCCP-EMsuite has particular strength in the building and refinement of atomic models into cryo-EM reconstructions, which is described in detail.

Download Full-text

Indefinite Proximity Learning: A Review

Neural Computation ◽

10.1162/neco_a_00770 ◽

2015 ◽

Vol 27 (10) ◽

pp. 2039-2096 ◽

Cited By ~ 30

Author(s):

Frank-Michael Schleif ◽

Peter Tino

Keyword(s):

Large Scale ◽

Data Representation ◽

Easy Access ◽

Proximity Data ◽

Standard Data ◽

Comprehensive Survey ◽

Proximity Measures ◽

The Individual ◽

Efficient Learning ◽

One Class Classification

Efficient learning of a data analysis task strongly depends on the data representation. Most methods rely on (symmetric) similarity or dissimilarity representations by means of metric inner products or distances, providing easy access to powerful mathematical formalisms like kernel or branch-and-bound approaches. Similarities and dissimilarities are, however, often naturally obtained by nonmetric proximity measures that cannot easily be handled by classical learning algorithms. Major efforts have been undertaken to provide approaches that can either directly be used for such data or to make standard methods available for these types of data. We provide a comprehensive survey for the field of learning with nonmetric proximities. First, we introduce the formalism used in nonmetric spaces and motivate specific treatments for nonmetric proximity data. Second, we provide a systematization of the various approaches. For each category of approaches, we provide a comparative discussion of the individual algorithms and address complexity issues and generalization properties. In a summarizing section, we provide a larger experimental study for the majority of the algorithms on standard data sets. We also address the problem of large-scale proximity learning, which is often overlooked in this context and of major importance to make the method relevant in practice. The algorithms we discuss are in general applicable for proximity-based clustering, one-class classification, classification, regression, and embedding approaches. In the experimental part, we focus on classification tasks.

Download Full-text

HIERARCHICAL MAPPING FOR HPC APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626411000229 ◽

2011 ◽

Vol 21 (03) ◽

pp. 279-299 ◽

Cited By ~ 1

Author(s):

I-HSIN CHUNG ◽

CHE-RUNG LEE ◽

JIAZHENG ZHOU ◽

YEH-CHING CHUNG

Keyword(s):

High Performance ◽

Large Scale ◽

Scale Up ◽

Matrix Multiplication ◽

Spectral Graph Theory ◽

Communication Patterns ◽

Fine Tuning ◽

Mapping Algorithm ◽

Communication Time ◽

Run Time

As the high performance computing systems scale up, mapping the tasks of a parallel application onto physical processors to allow efficient communication becomes one of the critical performance issues. Existing algorithms were usually designed to map applications with regular communication patterns. Their mapping criterion usually overlooks the size of communicated messages, which is the primary factor of communication time. In addition, most of their time complexities are too high to process large scale problems. In this paper, we present a hierarchical mapping algorithm (HMA), which is capable of mapping applications with irregular communication patterns. It first partitions tasks according to their run-time communication information. The tasks that communicate with each other more frequently are regarded as strongly connected. Based on their connectivity strength, the tasks are partitioned into supernodes based on the algorithms in spectral graph theory. The hierarchical partitioning reduces the mapping algorithm complexity to achieve scalability. Finally, the run-time communication information will be used again in fine tuning to explore better mappings. With the experiments, we show how the mapping algorithm helps to reduce the point-to-point communication time for the PDGEMM, a ScaLAPACK matrix multiplication computation kernel, up to 20% and the AMG2006, a tier 1 application of the Sequoia benchmark, up to 7%.

Download Full-text

emiRIT: A text-mining based resource for microRNA information

10.1101/2020.11.05.370593 ◽

2020 ◽

Author(s):

Debarati Roychowdhury ◽

Samir Gupta ◽

Xihan Qin ◽

Cecilia N. Arighi ◽

K. Vijay-Shanker

Keyword(s):

Text Mining ◽

Information Needs ◽

Large Scale ◽

Biological Process ◽

Essential Gene ◽

Mirna Gene ◽

Easy Access ◽

Context Specific ◽

Biological Entities ◽

User Friendly

AbstractMotivationmicroRNAs (miRNAs) are essential gene regulators and their dysregulation often leads to diseases. Easy access to miRNA information is crucial for interpreting generated experimental data, connecting facts across publications, and developing new hypotheses built on previous knowledge. Here, we present emiRIT, a text mining-based resource, which presents miRNA information mined from the literature through a user-friendly interface.ResultsWe collected 149,233 miRNA-PubMed ID pairs from Medline between January 1997 to May 2020. emiRIT currently contains miRNA-gene regulation (60,491 relations); miRNA-disease (cancer) (12,300 relations); miRNA-biological process and pathways (23,390 relations); and circulatory miRNAs in extracellular locations (3,782 relations). Biological entities and their relation to miRNAs were extracted from Medline abstracts using publicly available and in-house developed text mining tools, and the entities were normalized to facilitate querying and integration. We built a database and an interface to store and access the integrated data, respectively.ConclusionWe provide an up-to-date and user-friendly resource to facilitate access to comprehensive miRNA information from the literature on a large-scale, enabling users to navigate through different roles of miRNA and examine them in a context specific to their information needs. To assess our resource’s information coverage, in the absence of gold standards, we have conducted two case studies focusing on the target and differential expression information of miRNAs in the context of diseases. Database URL: https://research.bioinformatics.udel.edu/emirit/

Download Full-text

A Measurable Framework for Run-time Data Sampling in Large-scale Datacenter

2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP) ◽

10.1109/icsidp47821.2019.9173399 ◽

2019 ◽

Author(s):

Hedong Yan ◽

Shilin Wen ◽

Rui Han

Keyword(s):

Large Scale ◽

Data Sampling ◽

Time Data ◽

Run Time

Download Full-text