Scalable Data Mining, Archiving, and Big Data Management for the Next Generation Astronomical Telescopes

Author(s):  
Chris A. Mattmann ◽  
Andrew Hart ◽  
Luca Cinquini ◽  
Joseph Lazio ◽  
Shakeh Khudikyan ◽  
...  

Big data as a paradigm focuses on data volume, velocity, and on the number and complexity of various data formats and metadata, a set of information that describes other data types. This is nowhere better seen than in the development of the software to support next generation astronomical instruments including the MeerKAT/KAT-7 Square Kilometre Array (SKA) precursor in South Africa, in the Low Frequency Array (LOFAR) in Europe, in two instruments led in part by the U.S. National Radio Astronomy Observatory (NRAO) with its Expanded Very Large Array (EVLA) in Socorro, NM, and Atacama Large Millimeter Array (ALMA) in Chile, and in other instruments such as the Large Synoptic Survey Telescope (LSST) to be built in northern Chile. This chapter highlights the big data challenges in constructing data management systems for these astronomical instruments, specifically the challenge of integrating legacy science codes, handling data movement and triage, building flexible science data portals and user interfaces, allowing for flexible technology deployment scenarios, and in automatically and rapidly mitigating the difference in science data formats and metadata models. The authors discuss these challenges and then suggest open source solutions to them based on software from the Apache Software Foundation including Apache Object-Oriented Data Technology (OODT), Tika, and Solr. The authors have leveraged these solutions to effectively and expeditiously build many precursor and operational software systems to handle data from these astronomical instruments and to prepare for the coming data deluge from those not constructed yet. Their solutions are not specific to the astronomical domain and they are already applicable to a number of science domains including Earth, planetary, and biomedicine.

Big Data ◽  
2016 ◽  
pp. 2199-2225
Author(s):  
Chris A. Mattmann ◽  
Andrew Hart ◽  
Luca Cinquini ◽  
Joseph Lazio ◽  
Shakeh Khudikyan ◽  
...  

Big data as a paradigm focuses on data volume, velocity, and on the number and complexity of various data formats and metadata, a set of information that describes other data types. This is nowhere better seen than in the development of the software to support next generation astronomical instruments including the MeerKAT/KAT-7 Square Kilometre Array (SKA) precursor in South Africa, in the Low Frequency Array (LOFAR) in Europe, in two instruments led in part by the U.S. National Radio Astronomy Observatory (NRAO) with its Expanded Very Large Array (EVLA) in Socorro, NM, and Atacama Large Millimeter Array (ALMA) in Chile, and in other instruments such as the Large Synoptic Survey Telescope (LSST) to be built in northern Chile. This chapter highlights the big data challenges in constructing data management systems for these astronomical instruments, specifically the challenge of integrating legacy science codes, handling data movement and triage, building flexible science data portals and user interfaces, allowing for flexible technology deployment scenarios, and in automatically and rapidly mitigating the difference in science data formats and metadata models. The authors discuss these challenges and then suggest open source solutions to them based on software from the Apache Software Foundation including Apache Object-Oriented Data Technology (OODT), Tika, and Solr. The authors have leveraged these solutions to effectively and expeditiously build many precursor and operational software systems to handle data from these astronomical instruments and to prepare for the coming data deluge from those not constructed yet. Their solutions are not specific to the astronomical domain and they are already applicable to a number of science domains including Earth, planetary, and biomedicine.


2020 ◽  
Vol 7 (1) ◽  
pp. 3-9
Author(s):  
Yu.N. Bartashevskaya ◽  

The article considers the problem of using Big Data in a modern economics and public life. The volumes and complexity of information are growing rapidly, but modern technologies cannot ensure their effective use. There is a lag in technologies, methods, and practices for using Big Data. The imbalance can be changed by semantic technologies, characterized by a different approach to the processing and use of data. This approach is based on the use of knowledge. Proved that despite the rather long time of the existence of semantic technologies and semantic networks, there are many obstacles to their effective application. These are the problems of accessibility of semantic content, accessibility of ontologies, their evolution, scalability and multilingualism. And since far from all the data presented on the network is created in terms of semantic markup and is unlikely to be brought to it in the future, the problem of accessibility of semantic content is one of the main ones. The article shows the difference between the semantic network and the semantic Web, and also indicates the development technologies of the latter. As the subject of study, the module of the courses of the Alfred Nobel University was selected. The composition of a separate module or a separate course is examined in detail: data on the university, lecturer, data on the provision of the course and language of its teaching, acquired skills, abilities, results and the like. A graph of the module of courses has been built on the example of the Alfred Nobel University in terms of ontology, its individual, most significant classes – components are considered. The main classes, subclasses and their contents are considered, data types (date, text, URL) are indicated. The ontological scheme has been converted to the RDF format, such as is necessary for modelling data in the semantic network and further research. The prospects for further research on the application of the selected model for representing knowledge, using the query language, obtaining and interpreting data from other universities, etc. are determined. Keywords: semantic technologies, semantic networks, ontologies, CmapTools, course module graph.


Author(s):  
Ezer Osei Yeboah-Boateng

Big data is characterized as huge datasets generated at a fast rate, in unstructured, semi-structured, and structured data formats, with inconsistencies and disparate data types and sources. The challenge is having the right tools to process large datasets in an acceptable timeframe and within reasonable cost range. So, how can social media big datasets be harnessed for best value decision making? The approach adopted was site scraping to collect online data from social media and other websites. The datasets have been harnessed to provide better understanding of customers' needs and preferences. It's applied to design targeted campaigns, to optimize business processes, and to improve performance. Using the social media facts and rules, a multivariate value creation decision model was built to assist executives to create value based on improved “knowledge” in a hindsight-foresight-insight continuum about their operations and initiatives and to make informed decisions. The authors also demonstrated use cases of insights computed as equations that could be leveraged to create sustainable value.


2016 ◽  
Author(s):  
Angéla Olasz ◽  
Binh Nguyen Thai

In recent years, distributed computing has reached many areas of computer science including geographic and remote sensing information systems. However, distributed data processing solutions have primarily been focused on processing simple structured documents, rather than complex geospatial data. Hence, migrating current algorithms and data management to a distributed processing environment may require a great deal of effort. In data processing, different aspects are to be considered such as speed, precision or timeliness. All depending on data types and processing methods. Available data volume and variety evolving as never before which instantly exceeding the capabilities of traditional algorithm performance and hardware environment in the aspect of data management and computation. Augmented efficiency is required to exploit the available information derived from Geospatial Big Data. Most of the current distributed computing frameworks have important limitations on transparent and flexible control on processing (and/or storage) nodes. Hence, this paper presents a prototype for distribution (“tiling”), aggregation (“stitching”) and processing of Big Geospatial Data focusing the distribution and processing of raster data type. Furthermore, we introduce an own data and metadata catalogue enables to store the “lifecycle” of datasets, accessible for users and processes. The data distribution framework has no limitations on programming environment and can execute scripts (and workflows) written in different language (e.g. Python, R or C#). It is capable of processing raster, vector and point cloud data allowing full control of data distribution and processing. In this paper, the IQLib concept (https://github.com/posseidon/IQLib/) and background of practical realization as a prototype is presented, formulated within the IQmulus EU FP7 research and development project (http://www.iqmulus.eu). Further investigations on algorithmic and implementation details are in focus for the oral presentation.


2016 ◽  
Author(s):  
Angéla Olasz ◽  
Binh Nguyen Thai

In recent years, distributed computing has reached many areas of computer science including geographic and remote sensing information systems. However, distributed data processing solutions have primarily been focused on processing simple structured documents, rather than complex geospatial data. Hence, migrating current algorithms and data management to a distributed processing environment may require a great deal of effort. In data processing, different aspects are to be considered such as speed, precision or timeliness. All depending on data types and processing methods. Available data volume and variety evolving as never before which instantly exceeding the capabilities of traditional algorithm performance and hardware environment in the aspect of data management and computation. Augmented efficiency is required to exploit the available information derived from Geospatial Big Data. Most of the current distributed computing frameworks have important limitations on transparent and flexible control on processing (and/or storage) nodes. Hence, this paper presents a prototype for distribution (“tiling”), aggregation (“stitching”) and processing of Big Geospatial Data focusing the distribution and processing of raster data type. Furthermore, we introduce an own data and metadata catalogue enables to store the “lifecycle” of datasets, accessible for users and processes. The data distribution framework has no limitations on programming environment and can execute scripts (and workflows) written in different language (e.g. Python, R or C#). It is capable of processing raster, vector and point cloud data allowing full control of data distribution and processing. In this paper, the IQLib concept (https://github.com/posseidon/IQLib/) and background of practical realization as a prototype is presented, formulated within the IQmulus EU FP7 research and development project (http://www.iqmulus.eu). Further investigations on algorithmic and implementation details are in focus for the oral presentation.


2015 ◽  
pp. 1394-1407 ◽  
Author(s):  
Yu-Che Chen ◽  
Tsui-Chuan Hsieh

“Big data” is one of the emerging and critical issues facing government in the digital age. This study first delineates the defining features of big data (volume, velocity, and variety) and proposes a big data typology that is suitable for the public sector. This study then examines the opportunities of big data in generating business analytics to promote better utilization of information and communication technology (ICT) resources and improved personalization of e-government services. Moreover, it discusses the big data management challenges in building appropriate governance structure, integrating diverse data sources, managing digital privacy and security risks, and acquiring big data talent and tools. An effective big data management strategy to address these challenges should develop a stakeholder-focused and performance-oriented governance structure and build capacity for data management and business analytics as well as leverage and prioritize big data assets for performance. In addition, this study illustrates the opportunities, challenges, and strategy for big service data in government with the E-housekeeper program in Taiwan. This brief case study offers insight into the implementation of big data for improving government information and services. This article concludes with the main findings and topics of future research in big data for public administration.


2020 ◽  
Author(s):  
Andrew Conway ◽  
Adam Leadbetter ◽  
Tara Keena

<p>Integration of data management systems is a persistent problem in European projects that span multiple agencies. Months, if not years of projects are often expended on the integration of disparate database structures, data types, methodologies and outputs. Moreover, this work is usually confined to a single effort, meaning it is needlessly repeated on subsequent projects. The legacy effect of removing these barriers could therefore yield monetary and time savings for all involved, far beyond a single cross-jurisdictional project. </p><p>The European Union’s INTERREG VA Programme has funded the COMPASS project to better manage marine protected areas (MPA) in peripheral areas. Involving five organisations, spread across two nations, the project has developed a cross-border network for marine monitoring. Three of those organisations are UK-based and bound for Brexit (the Agri-Food and Biosciences Institute, Marine Scotland Science and the Scottish Association of Marine Science). With that network under construction, significant efforts have been placed on harmonizing data management processes and procedures between the partners. </p><p>A data management quality management framework (DM-QMF) was introduced to guide this harmonization and ensure adequate quality controls would be enforced. As lead partner on data management, the Irish Marine Institute (MI) initially shared guidelines for infrastructure, architecture and metadata. The implementation of those requirements were then left to the other four partners, with the MI acting as facilitator. This led to the following being generated for each process in the project:</p><p>Data management plan: Information on how and what data were to be generated as well as where it would be stored. </p><p>Flow diagrams: Diagrammatic overview of the flow of data through the project. </p><p>Standard Operating Procedures: Detailed explanatory documents on the precise workings of a process.</p><p>Data management processes were allowed to evolve naturally out of a need to adhere to this set standard. Organisations were able to work within their operational limitations, without being required to alter their existing procedures, but encouraged to learn from each other. Very quickly it was found that there were similarities in processes, where previously it was thought there were significant differences. This process of sharing data management information has created mutually benefiting synergies and enabled the convergence of procedures within the separate organisations. </p><p>The downstream data management synergies that COMPASS has produced have already taken effect. Sister INTERREG VA projects, SeaMonitor and MarPAMM, have felt the benefits. The same data management systems cultivated as part of the COMPASS project are being reused, while the groundwork in creating strong cross boundary channels of communication and cooperation are saving significant amounts of time in project coordination.</p><p>Through data management, personal and institutional relationships have been strengthened, both of which should persist beyond the project terminus in 2021, well into a post-Brexit Europe. The COMPASS project has been an exemplar of how close collaboration can persist and thrive in a changing political environment, in spite of the ongoing uncertainty surrounding Brexit.</p>


Sign in / Sign up

Export Citation Format

Share Document