Ligo: An Open Source Application for the Management and Execution of Administrative Data Linkage

IntroductionLigo is an open source application that provides a framework for managing and executing administrative data linking projects. Ligo provides an easy-to-use web interface that lets analysts select among data linking methods including deterministic, probabilistic and machine learning approaches and use these in a documented, repeatable, tested, step-by-step process. Objectives and ApproachThe linking application has two primary functions: identifying common entities in datasets [de-duplication] and identifying common entities between datasets [linking]. The application is being built from the ground up in a partnership between the Province of British Columbia’s Data Innovation (DI) Program and Population Data BC, and with input from data scientists. The simple web interface allows analysts to streamline the processing of multiple datasets in a straight-forward and reproducible manner. ResultsBuilt in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae. Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. A fully functional alpha, including deterministic and probabilistic methods is expected to be ready in September, with a machine learning extension expected soon after. Conclusion/ImplicationsLigo has been designed with enterprise users in mind. The application is intended to make the processes of data de-duplication and linking simple, fast and reproducible. By making the application open source, we encourage feedback and collaboration from across the population research and data science community.

Download Full-text

Jupyter: Thinking and Storytelling with Code and Data

10.22541/au.161298309.98344404/v2 ◽

2021 ◽

Author(s):

Brian Granger ◽

Fernando Pérez

Keyword(s):

Machine Learning ◽

Open Source ◽

Community Of Practice ◽

Scientific Computing ◽

Data Science ◽

Climate Science ◽

Three Dimensions ◽

Open Source Project ◽

Interactive Computing ◽

The Impact

Project Jupyter is an open-source project for interactive computing widely used in data science, machine learning, and scientific computing. We argue that even though Jupyter helps users perform complex, technical work, Jupyter itself solves problems that are fundamentally human in nature. Namely, Jupyter helps humans to think and tell stories with code and data. We illustrate this by describing three dimensions of Jupyter: interactive computing, computational narratives, and the idea that Jupyter is more than software. We illustrate the impact of these dimensions on a community of practice in Earth and climate science.

Download Full-text

Towards Spatial Data Science: Bridging the Gap between GIS, Cartography and Data Science

Abstracts of the ICA ◽

10.5194/ica-abs-1-403-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-2

Author(s):

Jan Wilkening

Keyword(s):

Machine Learning ◽

Data Mining ◽

Open Source ◽

Real Time ◽

Spatial Data ◽

Data Science ◽

Spatial Concepts ◽

Front End ◽

Gis Tools ◽

University Curricula

Abstract. Data is regarded as the oil of the 21st century, and the concept of data science has received increasing attention in the last years. These trends are mainly caused by the rise of big data &ndash; data that is big in terms of volume, variety and velocity. Consequently, data scientists are required to make sense of these large datasets. Companies have problems acquiring talented people to solve data science problems. This is not surprising, as employers often expect skillsets that can hardly be found in one person: Not only does a data scientist need to have a solid background in machine learning, statistics and various programming languages, but often also in IT systems architecture, databases, complex mathematics. Above all, she should have a strong non-technical domain expertise in her field (see Figure 1).As it is widely accepted that 80% of data has a spatial component, developments in data science could provide exciting new opportunities for GIS and cartography: Cartographers are experts in spatial data visualization, and often also very skilled in statistics, data pre-processing and analysis in general. The cartographers’ skill levels often depend on the degree to which cartography programs at universities focus on the “front end” (visualisation) of a spatial data and leave the “back end” (modelling, gathering, processing, analysis) to GIScientists. In many university curricula, these front-end and back-end distinctions between cartographers and GIScientists are not clearly defined, and the boundaries are somewhat blurred.In order to become good data scientists, cartographers and GIScientists need to acquire certain additional skills that are often beyond their university curricula. These skills include programming, machine learning and data mining. These are important technologies for extracting knowledge big spatial data sets, and thereby the logical advancement to “traditional” geoprocessing, which focuses on “traditional” (small, structured, static) datasets such shapefiles or feature classes.To bridge the gap between spatial sciences (such as GIS and cartography) and data science, we need an integrated framework of “spatial data science” (Figure 2).Spatial sciences focus on causality, theory-based approaches to explain why things are happening in space. In contrast, the scope of data science is to find similar patterns in big datasets with techniques of machine learning and data mining &ndash; often without considering spatial concepts (such as topology, spatial indexing, spatial autocorrelation, modifiable area unit problems, map projections and coordinate systems, uncertainty in measurement etc.).Spatial data science could become the core competency of GIScientists and cartographers who are willing to integrate methods from the data science knowledge stack. Moreover, data scientists could enhance their work by integrating important spatial concepts and tools from GIS and cartography into data science workflows. A non-exhaustive knowledge stack for spatial data scientists, including typical tasks and tools, is given in Table 1.There are many interesting ongoing projects at the interface of spatial and data science. Examples from the ArcGIS platform include:<ul><li>Integration of Python GIS APIs with Machine Learning libraries, such as scikit-learn or TensorFlow, in Jupyter Notebooks</li><li>Combination of R (advanced statistics and visualization) and GIS (basic geoprocessing, mapping) in ModelBuilder and other automatization frameworks</li><li>Enterprise GIS solutions for distributed geoprocessing operations on big, real-time vector and raster datasets</li><li>Dashboards for visualizing real-time sensor data and integrating it with other data sources</li><li>Applications for interactive data exploration</li><li>GIS tools for Machine Learning tasks for prediction, clustering and classification of spatial data</li><li>GIS Integration for Hadoop</li></ul>While the discussion about proprietary (ArcGIS) vs. open-source (QGIS) software is beyond the scope of this article, it has to be stated that a.) many ArcGIS projects are actually open-source and b.) using a complete GIS platform instead of several open-source pieces has several advantages, particularly in efficiency, maintenance and support (see Wilkening et al. (2019) for a more detailed consideration). At any rate, cartography and GIS tools are the essential technology blocks for solving the (80% spatial) data science problems of the future.

Download Full-text

Glycowork: A Python package for glycan data science and machine learning

10.1101/2021.04.22.440981 ◽

2021 ◽

Author(s):

Luc Thomès ◽

Rebekka Burkholz ◽

Daniel Bojar

Keyword(s):

Machine Learning ◽

Open Source ◽

Data Science ◽

Biological Processes ◽

Biological Sequence ◽

Learning Models ◽

Related Data ◽

Strong Focus ◽

Python Package ◽

Machine Learning Models

AbstractAs a biological sequence, glycans occur in every domain of life and comprise monosaccharides that are chained together to form oligo- or polysaccharides. While glycans are crucial for most biological processes, existing analysis modalities make it difficult for researchers with limited computational background to include information from these diverse and nonlinear sequences into standard workflows. Here, we present glycowork, an open-source Python package that was designed for the processing and analysis of glycan data by end users, with a strong focus on glycan-related data science and machine learning. Glycowork includes numerous functions to, for instance, automatically annotate glycan motifs and analyze their distributions via heatmaps and statistical enrichment. We also provide visualization methods, routines to interact with stored databases, trained machine learning models, and learned glycan representations. We envision that glycowork can extract further insights from any glycan dataset and demonstrate this with several workflows that analyze glycan motifs in various biological contexts. Glycowork can be freely accessed at https://github.com/BojarLab/glycowork/.

Download Full-text

Estimated Number of Short-Stay Service Recipients in Hokkaido Prefecture, Japan, from 2020 to 2045: Estimation by Machine Learning and Review of Changing Trend by Cartogram

10.21203/rs.3.rs-111572/v1 ◽

2020 ◽

Author(s):

Junko Ouchi ◽

Kanetoshi Hattori

Keyword(s):

Machine Learning ◽

Population Data ◽

Learning Approaches ◽

Short Stay ◽

Term Care ◽

Administrative Unit ◽

Future Care ◽

Level 3 ◽

Changing Trend ◽

Administrative Units

Abstract Background: The present study aimed to estimate the numbers of short-stay service recipients in all administrative units in Hokkaido from 2020 to 2045 with the machine learning approaches and reviewed the changing trends of spatial distributions of the service recipients with cartograms.Methods: A machine learning approach was used for the estimation. To develop the model to estimate, population data in Japan from 2015 to 2017 were used as input signals, whereas data on the numbers of short-stay service recipients at each level of needs for long-term care (levels 1–5) from 2015 to 2017 were used as a supervisory signal. Three models were developed to avoid problems of repeatability. Then, data of the projected population in Hokkaido every 5 years from 2020 to 2045 were fed into each model to estimate the numbers of the service recipients for the 188 administrative units of Hokkaido. The medians of the estimations from the models were considered as the final results; the estimates for 188 administrative units were presented with continuous area cartograms on the map of Hokkaido.Results: The developed models predicted that the number of the service recipients in Hokkaido would peak at 18,016 in 2035 and the number of people at level 3 in particular would increase. The cartograms for levels 2 and 3 from 2020 to 2030 and level 3 for 2035 were heavily distorted in the several populated areas in Hokkaido, indicating that the majority of the service recipients would be concentrated in those populated areas. Conclusions: Machine learning approaches could provide estimates of future care demands for each administrative unit in a prefecture in Japan based on past population and care demand data. Results from the present study can be useful when effective allocations of human resources for nursing care in the region are discussed.

Download Full-text

Early Detection on Students' Failing Open-Source based Course Projects using Machine Learning Approaches

Proceedings of the 49th ACM Technical Symposium on Computer Science Education - SIGCSE '18 ◽

10.1145/3159450.3162312 ◽

2018 ◽

Author(s):

Yifan Guo ◽

Yang Song ◽

Edward F. Gehringer

Keyword(s):

Machine Learning ◽

Early Detection ◽

Open Source ◽

Learning Approaches

Download Full-text

Towards Autonomous Machine Learning in Chemistry via Evolutionary Algorithms

10.26434/chemrxiv.9782387 ◽

2019 ◽

Author(s):

Gaurav Vishwakarma ◽

Mojtaba Haghighatlari ◽

Johannes Hachmann

Keyword(s):

Machine Learning ◽

Genetic Algorithm ◽

Model Selection ◽

Data Science ◽

Multi Objective Optimization ◽

Rational Model ◽

Hyperparameter Optimization ◽

Promising Tool ◽

Science Community ◽

Autonomous Machine

Machine learning has been emerging as a promising tool in the chemical and materials domain. In this paper, we introduce a framework to automatically perform rational model selection and hyperparameter optimization that are important concerns for the efficient and successful use of machine learning, but have so far largely remained unexplored by this community. The framework features four variations of genetic algorithm and is implemented in the chemml program package. Its performance is benchmarked against popularly used algorithms and packages in the data science community and the results show that our implementation outperforms these methods both in terms of time and accuracy. The effectiveness of our implementation is further demonstrated via a scenario involving multi-objective optimization for model selection.

Download Full-text

Data Analytics and Modeling in IoT-Fog Environment for Resource Constrained IoT-Applications - A Review

Recent Advances in Computer Science and Communications ◽

10.2174/2666255814666210715161630 ◽

2021 ◽

Vol 14 ◽

Author(s):

Omar Farooq ◽

Parminder Singh

Keyword(s):

Machine Learning ◽

Data Science ◽

Data Classification ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Resource Constrained ◽

Learning Techniques ◽

Iot Devices ◽

The Right ◽

Continuous Use

Introduction: The emergence of the concepts like Big Data, Data Science, Machine Learning (ML), and the Internet of Things (IoT) has added the potential of research in today's world. The continuous use of IoT devices, sensors, etc. that collect data continuously puts tremendous pressure on the existing IoT network. Materials and Methods: This resource-constrained IoT environment is flooded with data acquired from millions of IoT nodes deployed at the device level. The limited resources of the IoT Network have driven the researchers towards data Management. This paper focuses on data classification at the device level, edge/fog level, and cloud level using machine learning techniques. Results: The data coming from different devices is vast and is of variety. Therefore, it becomes essential to choose the right approach for classification and analysis. It will help optimize the data at the device edge/fog level to better the network's performance in the future. Conclusion: This paper presents data classification, machine learning approaches, and a proposed mathematical model for the IoT environment.

Download Full-text

Opportunities to observe and measure intangible inputs to innovation: Definitions, operationalization, and examples

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1800467115 ◽

2018 ◽

Vol 115 (50) ◽

pp. 12638-12645 ◽

Cited By ~ 3

Author(s):

Sallie Keller ◽

Gizem Korkmaz ◽

Carol Robbins ◽

Stephanie Shipp

Keyword(s):

Open Source ◽

Administrative Data ◽

Open Source Software ◽

Data Science ◽

Business Processes ◽

Organizational Innovation ◽

Innovation Process ◽

Sources Of Information ◽

First Case ◽

Science Framework

Measuring the value of intangibles is not easy, because they are critical but usually invisible components of the innovation process. Today, access to nonsurvey data sources, such as administrative data and repositories captured on web pages, opens opportunities to create intangibles based on new sources of information and capture intangible innovations in new ways. Intangibles include ownership of innovative property and human resources that make a company unique but are currently unmeasured. For example, intangibles represent the value of a company’s databases and software, the tacit knowledge of their workers, and the investments in research and development (R&D) and design. Through two case studies, the challenges and processes to both create and measure intangibles are presented using a data science framework that outlines processes to discover, acquire, profile, clean, link, explore the fitness-for-use, and statistically analyze the data. The first case study shows that creating organizational innovation is possible by linking administrative data across business processes in a Fortune 500 company. The motivation for this research is to develop company processes capable of synchronizing their supply chain end to end while capturing dynamics that can alter the inventory, profits, and service balance. The second example shows the feasibility of measurement of innovation related to the characteristics of open source software through data scraped from software repositories that provide this information. The ultimate goal is to develop accurate and repeatable measures to estimate the value of nonbusiness sector open source software to the economy. This early work shows the feasibility of these approaches.

Download Full-text

Coupling data science with community crowdsourcing for urban renewal policy analysis: An evaluation of Atlanta’s Anti-Displacement Tax Fund

Environment and Planning B Urban Analytics and City Science ◽

10.1177/2399808318819847 ◽

2018 ◽

Vol 47 (6) ◽

pp. 1081-1097 ◽

Cited By ~ 3

Author(s):

Jeremy Auerbach ◽

Christopher Blackburn ◽

Hayley Barton ◽

Amanda Meng ◽

Ellen Zegura

Keyword(s):

Machine Learning ◽

Urban Renewal ◽

Data Science ◽

Property Tax ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Household Level ◽

Learning Techniques ◽

Level Data ◽

Program Costs

We estimate the cost and impact of a proposed anti-displacement program in the Westside of Atlanta (GA) with data science and machine learning techniques. This program intends to fully subsidize property tax increases for eligible residents of neighborhoods where there are two major urban renewal projects underway, a stadium and a multi-use trail. We first estimate household-level income eligibility for the program with data science and machine learning approaches applied to publicly available household-level data. We then forecast future property appreciation due to urban renewal projects using random forests with historic tax assessment data. Combining these projections with household-level eligibility, we estimate the costs of the program for different eligibility scenarios. We find that our household-level data and machine learning techniques result in fewer eligible homeowners but significantly larger program costs, due to higher property appreciation rates than the original analysis, which was based on census and city-level data. Our methods have limitations, namely incomplete data sets, the accuracy of representative income samples, the availability of characteristic training set data for the property tax appreciation model, and challenges in validating the model results. The eligibility estimates and property appreciation forecasts we generated were also incorporated into an interactive tool for residents to determine program eligibility and view their expected increases in home values. Community residents have been involved with this work and provided greater transparency, accountability, and impact of the proposed program. Data collected from residents can also correct and update the information, which would increase the accuracy of the program estimates and validate the modeling, leading to a novel application of community-driven data science.

Download Full-text

Estimated Number of Short-Stay Service Recipients in Hokkaido Prefecture, Japan, from 2020 to 2045: Estimation by Machine Learning and Review of Changing Trend by Cartogram

10.21203/rs.3.rs-102898/v1 ◽

2020 ◽

Author(s):

Junko Ouchi ◽

Kanetoshi Hattori

Keyword(s):

Machine Learning ◽

Human Resources ◽

Elderly Care ◽

Population Decline ◽

Population Data ◽

Learning Approaches ◽

Short Stay ◽

Term Care ◽

Level 3 ◽

Administrative Units

Abstract Background: For effective allocations of limited human resources, it is significant to have accurate estimates of future elderly care demands at the regional level, especially in areas where aging and population decline vary by region, such as Hokkaido prefecture in Japan.Objectives: The present study aimed to estimate the numbers of short-stay service recipients in all administrative units in Hokkaido from 2020 to 2045 with the machine learning approaches and reviewed the changing trends of spatial distributions of the service recipients with cartograms.Methods: A machine learning approach was used for the estimation. To develop the model to estimate, population data in Japan from 2015 to 2017 were used as input signals, whereas data on the numbers of short-stay service recipients at each level of needs for long-term care (levels 1–5) from 2015 to 2017 were used as a supervisory signal. Three models were developed to avoid problems of repeatability. Then, data of the projected population in Hokkaido every 5 years from 2020 to 2045 were fed into each model to estimate the numbers of the service recipients for the 188 administrative units of Hokkaido. The medians of the estimations from the models were considered as the final results; the estimates for 188 administrative units were presented with continuous area cartograms on the map of Hokkaido.Results: The developed models predicted that the number of the service recipients in Hokkaido would peak at 18,016 in 2035 and the number of people at level 3 in particular would increase. The cartograms for levels 2 and 3 from 2020 to 2030 and level 3 for 2035 were heavily distorted in the populated areas in Hokkaido.Discussion: The large correlation coefficients indicated the accuracy of estimations by the developed models. The growing number of the service recipients especially at level 3 by 2035 was assumed to be related to aging of the first baby boomers in Japan. The distortions of the cartograms suggested that the majority of the service recipients would be concentrated in the populated areas in Hokkaido. Future allocations of human resources were discussed on the basis of the findings.

Download Full-text