Towards Spatial Data Science: Bridging the Gap between GIS, Cartography and Data Science

Abstract. Data is regarded as the oil of the 21st century, and the concept of data science has received increasing attention in the last years. These trends are mainly caused by the rise of big data &ndash; data that is big in terms of volume, variety and velocity. Consequently, data scientists are required to make sense of these large datasets. Companies have problems acquiring talented people to solve data science problems. This is not surprising, as employers often expect skillsets that can hardly be found in one person: Not only does a data scientist need to have a solid background in machine learning, statistics and various programming languages, but often also in IT systems architecture, databases, complex mathematics. Above all, she should have a strong non-technical domain expertise in her field (see Figure 1).As it is widely accepted that 80% of data has a spatial component, developments in data science could provide exciting new opportunities for GIS and cartography: Cartographers are experts in spatial data visualization, and often also very skilled in statistics, data pre-processing and analysis in general. The cartographers’ skill levels often depend on the degree to which cartography programs at universities focus on the “front end” (visualisation) of a spatial data and leave the “back end” (modelling, gathering, processing, analysis) to GIScientists. In many university curricula, these front-end and back-end distinctions between cartographers and GIScientists are not clearly defined, and the boundaries are somewhat blurred.In order to become good data scientists, cartographers and GIScientists need to acquire certain additional skills that are often beyond their university curricula. These skills include programming, machine learning and data mining. These are important technologies for extracting knowledge big spatial data sets, and thereby the logical advancement to “traditional” geoprocessing, which focuses on “traditional” (small, structured, static) datasets such shapefiles or feature classes.To bridge the gap between spatial sciences (such as GIS and cartography) and data science, we need an integrated framework of “spatial data science” (Figure 2).Spatial sciences focus on causality, theory-based approaches to explain why things are happening in space. In contrast, the scope of data science is to find similar patterns in big datasets with techniques of machine learning and data mining &ndash; often without considering spatial concepts (such as topology, spatial indexing, spatial autocorrelation, modifiable area unit problems, map projections and coordinate systems, uncertainty in measurement etc.).Spatial data science could become the core competency of GIScientists and cartographers who are willing to integrate methods from the data science knowledge stack. Moreover, data scientists could enhance their work by integrating important spatial concepts and tools from GIS and cartography into data science workflows. A non-exhaustive knowledge stack for spatial data scientists, including typical tasks and tools, is given in Table 1.There are many interesting ongoing projects at the interface of spatial and data science. Examples from the ArcGIS platform include:<ul><li>Integration of Python GIS APIs with Machine Learning libraries, such as scikit-learn or TensorFlow, in Jupyter Notebooks</li><li>Combination of R (advanced statistics and visualization) and GIS (basic geoprocessing, mapping) in ModelBuilder and other automatization frameworks</li><li>Enterprise GIS solutions for distributed geoprocessing operations on big, real-time vector and raster datasets</li><li>Dashboards for visualizing real-time sensor data and integrating it with other data sources</li><li>Applications for interactive data exploration</li><li>GIS tools for Machine Learning tasks for prediction, clustering and classification of spatial data</li><li>GIS Integration for Hadoop</li></ul>While the discussion about proprietary (ArcGIS) vs. open-source (QGIS) software is beyond the scope of this article, it has to be stated that a.) many ArcGIS projects are actually open-source and b.) using a complete GIS platform instead of several open-source pieces has several advantages, particularly in efficiency, maintenance and support (see Wilkening et al. (2019) for a more detailed consideration). At any rate, cartography and GIS tools are the essential technology blocks for solving the (80% spatial) data science problems of the future.

Download Full-text

Jupyter: Thinking and Storytelling with Code and Data

10.22541/au.161298309.98344404/v2 ◽

2021 ◽

Author(s):

Brian Granger ◽

Fernando Pérez

Keyword(s):

Machine Learning ◽

Open Source ◽

Community Of Practice ◽

Scientific Computing ◽

Data Science ◽

Climate Science ◽

Three Dimensions ◽

Open Source Project ◽

Interactive Computing ◽

The Impact

Project Jupyter is an open-source project for interactive computing widely used in data science, machine learning, and scientific computing. We argue that even though Jupyter helps users perform complex, technical work, Jupyter itself solves problems that are fundamentally human in nature. Namely, Jupyter helps humans to think and tell stories with code and data. We illustrate this by describing three dimensions of Jupyter: interactive computing, computational narratives, and the idea that Jupyter is more than software. We illustrate the impact of these dimensions on a community of practice in Earth and climate science.

Download Full-text

Data Science

Deep Learning Innovations and Their Convergence With Big Data - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-3015-2.ch008 ◽

2018 ◽

pp. 141-151

Author(s):

Sabitha Rajagopal

Keyword(s):

Data Mining ◽

Decision Making ◽

Research And Development ◽

Real Time ◽

Data Science ◽

Data Warehousing ◽

Holistic Approach ◽

Digital Data ◽

Data Application ◽

Development Analysis

Data Science employs techniques and theories to create data products. Data product is merely a data application that acquires its value from the data itself, and creates more data as a result; it's not just an application with data. Data science involves the methodical study of digital data employing techniques of observation, development, analysis, testing and validation. It tackles the real time challenges by adopting a holistic approach. It ‘creates' knowledge about large and dynamic bases, ‘develops' methods to manage data and ‘optimizes' processes to improve its performance. The goal includes vital investigation and innovation in conjunction with functional exploration intended to notify decision-making for individuals, businesses, and governments. This paper discusses the emergence of Data Science and its subsequent developments in the fields of Data Mining and Data Warehousing. The research focuses on need, challenges, impact, ethics and progress of Data Science. Finally the insights of the subsequent phases in research and development of Data Science is provided.

Download Full-text

Prospects of Open Source Software for Maximizing the User Expectations in Heterogeneous Network

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch014 ◽

2021 ◽

pp. 257-271

Author(s):

Pushpa Singh ◽

Rajeev Agrawal

Keyword(s):

Machine Learning ◽

Open Source ◽

Real Time ◽

Open Source Software ◽

Heterogeneous Networks ◽

Heterogeneous Network ◽

Research Work ◽

Software Tool ◽

Knn Classifier ◽

User Expectations

This article focuses on the prospects of open source software and tools for maximizing the user expectations in heterogeneous networks. The open source software Python is used as a software tool in this research work for implementing machine learning technique for the categorization of the types of user in a heterogeneous network (HN). The KNN classifier available in Python defines the type of user category in real time to predict the available users in a particular category for maximizing profit for a business organization.

Download Full-text

How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science

Studies in Big Data - A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years ◽

10.1007/978-3-319-61893-7_17 ◽

2017 ◽

pp. 287-306 ◽

Cited By ~ 3

Author(s):

G. Amato ◽

L. Candela ◽

D. Castelli ◽

A. Esuli ◽

F. Falchi ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Data Base ◽

Data Science ◽

Relational Data ◽

Relational Data Base

Download Full-text

Big Data Methods

Organizational Research Methods ◽

10.1177/1094428116677299 ◽

2016 ◽

Vol 21 (3) ◽

pp. 525-547 ◽

Cited By ~ 55

Author(s):

Scott Tonidandel ◽

Eden B. King ◽

Jose M. Cortina

Keyword(s):

Machine Learning ◽

Data Mining ◽

Big Data ◽

Data Analytics ◽

Data Science ◽

Data Sources ◽

Future Research ◽

Organizational Science ◽

Associated Data ◽

Organizational Sciences

Advances in data science, such as data mining, data visualization, and machine learning, are extremely well-suited to address numerous questions in the organizational sciences given the explosion of available data. Despite these opportunities, few scholars in our field have discussed the specific ways in which the lens of our science should be brought to bear on the topic of big data and big data's reciprocal impact on our science. The purpose of this paper is to provide an overview of the big data phenomenon and its potential for impacting organizational science in both positive and negative ways. We identifying the biggest opportunities afforded by big data along with the biggest obstacles, and we discuss specifically how we think our methods will be most impacted by the data analytics movement. We also provide a list of resources to help interested readers incorporate big data methods into their existing research. Our hope is that we stimulate interest in big data, motivate future research using big data sources, and encourage the application of associated data science techniques more broadly in the organizational sciences.

Download Full-text

Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.749 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Greg Lawrance ◽

Raphael Parra Hernandez ◽

Khalegh Mamakani ◽

Suraiya Khan ◽

Brent Hills ◽

...

Keyword(s):

Machine Learning ◽

Open Source ◽

Administrative Data ◽

Data Science ◽

Population Data ◽

Probabilistic Methods ◽

Learning Approaches ◽

Web Interface ◽

Science Community ◽

Comparison Algorithms

IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.

Download Full-text

Latest Tools for Data Mining and Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i1003.0789s19 ◽

2019 ◽

Vol 8 (9S) ◽

pp. 18-23 ◽

Cited By ~ 2

Keyword(s):

Machine Learning ◽

Data Mining ◽

Decision Making ◽

Feature Selection ◽

Open Source ◽

Predictive Analysis ◽

Learning Tools ◽

Pros And Cons ◽

Selection For ◽

Extract Information

Nowadays, Data Mining is used everywhere for extracting information from the data and in turn, acquires knowledge for decision making. Data Mining analyzes patterns which are used to extract information and knowledge for making decisions. Many open source and licensed tools like Weka, RapidMiner, KNIME, and Orange are available for Data Mining and predictive analysis. This paper discusses about different tools available for Data Mining and Machine Learning, followed by the description, pros and cons of these tools. The article provides details of all the algorithms like classification, regression, characterization, discretization, clustering, visualization and feature selection for Data Mining and Machine Learning tools. It will help people for efficient decision making and suggests which tool is suitable according to their requirement.

Download Full-text

Implementation of Tensor Flow for Real-time Object Detection

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1265.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 2342-2345

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Object Detection ◽

Open Source ◽

Real Time ◽

Mobile Web ◽

Ongoing Process

Tensor Flow is an open-source Machine Learning library for research and creation. Tensor Flow offers APIs for beginners and specialists to create for work desktop, mobile, web, and cloud. The best utilizations of Google's Tensor flow are the best applications for deep learning . Deep Learning is extraordinary at example acknowledgment/machine recognition, and it's being connected to pictures, video, sound, voice, content and time arrangement information. It groups and bunch information like that with now and again superhuman precision. This can be actualized for the acknowledgment of the diverse items, for example, Ball, Cat, Bottle, Car and so forth. It can utilize Android as its stage with to utilize the cell phone's camera to prepare the informational indexes and perceive diverse items in ongoing process.

Download Full-text

Glycowork: A Python package for glycan data science and machine learning

10.1101/2021.04.22.440981 ◽

2021 ◽

Author(s):

Luc Thomès ◽

Rebekka Burkholz ◽

Daniel Bojar

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Biological Processes ◽

Biological Sequence ◽

Learning Models ◽

Related Data ◽

Strong Focus ◽

Python Package ◽

Machine Learning Models

AbstractAs a biological sequence, glycans occur in every domain of life and comprise monosaccharides that are chained together to form oligo- or polysaccharides. While glycans are crucial for most biological processes, existing analysis modalities make it difficult for researchers with limited computational background to include information from these diverse and nonlinear sequences into standard workflows. Here, we present glycowork, an open-source Python package that was designed for the processing and analysis of glycan data by end users, with a strong focus on glycan-related data science and machine learning. Glycowork includes numerous functions to, for instance, automatically annotate glycan motifs and analyze their distributions via heatmaps and statistical enrichment. We also provide visualization methods, routines to interact with stored databases, trained machine learning models, and learned glycan representations. We envision that glycowork can extract further insights from any glycan dataset and demonstrate this with several workflows that analyze glycan motifs in various biological contexts. Glycowork can be freely accessed at https://github.com/BojarLab/glycowork/.

Download Full-text

Technology Focus: Completions (April 2021)

Journal of Petroleum Technology ◽

10.2118/0421-0041-jpt ◽

2021 ◽

Vol 73 (04) ◽

pp. 41-41

Author(s):

Doug Lehr

Keyword(s):

Machine Learning ◽

Real Time ◽

Data Science ◽

Detection System ◽

Streaming Data ◽

Predictive Tool ◽

Well Completion ◽

Ensemble Machine Learning ◽

Casing Damage ◽

Technology Focus

In the 2020 Completions Technology Focus, I stated that digitization will forever change how the most complex problems in our industry are solved. And, despite another severe downturn in the upstream industry, data science continues to provide solutions for complex unconventional well problems. Casing Damage Casing collapse is an ongoing problem and almost always occurs in the heel of the well. It prevents passage of frac plugs and milling tools. Forcing a frac plug through the collapsed section damages the plug, predisposing it to failure, which leads to more casing damage and poor stimulation. One team has developed a machine-learning (ML) model showing a positive correlation between zones with high fracturing gradients and collapsed casing. The objective is a predictive tool that enables a completion design that avoids these zones. Fracture-Driven Interactions (FDIs) Can Be Avoided in Real Time Pressurized fracturing fluids from one well can communicate with fractures in a nearby well or can intersect that well-bore. Such FDIs can occur while fracturing a child well and can negatively affect production in the parent well. FDIs are caused by well spacing, depletion, or completion design but, until recently, were not quickly diagnosed. Analytics and machine learning now are being used to analyze streaming data sets during a frac job to detect FDIs. A recently piloted detection system alerts the operator in real time, which enables avoidance of FDIs on the fly. Data Science Provides the Tools Analyzing casing damage and FDIs is a complex task involving large amounts of data already available or easily acquired. Tools such as ML perform the data analysis and enable decision making. Data science is enabling the unconventional “onion” to be peeled many layers at a time. Recommended additional reading at OnePetro: www.onepetro.org. SPE 199967 - Artificial Intelligence for Real-Time Monitoring of Fracture-Driven Interactions and Simultaneous Completion Optimization by Hayley Stephenson, Baker Hughes, et al. SPE 201615 - Novel Completion Design To Bypass Damage and Increase Reservoir Contact: A Middle Magdalena, Central Colombian Case History by Rosana Polo, Oxy, et al. SPE 202966 - Well Completion Optimization in Canada Tight Gas Fields Using Ensemble Machine Learning by Lulu Liao, Sinopec, et al.

Download Full-text