A Multi-Feature Based Automatic Approach to Geospatial Record Linking

Author(s):  
Ying Zhang ◽  
Puhai Yang ◽  
Chaopeng Li ◽  
Gengrui Zhang ◽  
Cheng Wang ◽  
...  

This article describes how geographic information systems (GISs) can enable, enrich and enhance geospatial applications and services. Accurate calculation of the similarity among geospatial entities that belong to different data sources is of great importance for geospatial data linking. At present, most research works use the name or category of the entity to measure the similarity of geographic information. Although the geospatial relationship is significant for geographic similarity measure, it has been ignored by most of the previous works. This article introduces the geospatial relationship and topology, and proposes an approach to compute the geospatial record similarity based on multiple features including the geospatial relationships, category and name tags. In order to improve the flexibility and operability, supervised machine learning such as SVM is used for the task of classifying pairs of mapping records. The authors test their approach using three sources, namely, OpenStreetMap, Google and Wikimapia. The results showed that the proposed approach obtained high correlation with the human judgements.

2019 ◽  
pp. 10-30
Author(s):  
Ying Zhang ◽  
Puhai Yang ◽  
Chaopeng Li ◽  
Gengrui Zhang ◽  
Cheng Wang ◽  
...  

This article describes how geographic information systems (GISs) can enable, enrich and enhance geospatial applications and services. Accurate calculation of the similarity among geospatial entities that belong to different data sources is of great importance for geospatial data linking. At present, most research works use the name or category of the entity to measure the similarity of geographic information. Although the geospatial relationship is significant for geographic similarity measure, it has been ignored by most of the previous works. This article introduces the geospatial relationship and topology, and proposes an approach to compute the geospatial record similarity based on multiple features including the geospatial relationships, category and name tags. In order to improve the flexibility and operability, supervised machine learning such as SVM is used for the task of classifying pairs of mapping records. The authors test their approach using three sources, namely, OpenStreetMap, Google and Wikimapia. The results showed that the proposed approach obtained high correlation with the human judgements.


2020 ◽  
Vol 12 (01) ◽  
pp. 2050003
Author(s):  
Ahmed Lasisi ◽  
Pengyu Li ◽  
Jian Chen

Highway-rail grade crossing (HRGC) accidents continue to be a major source of transportation casualties in the United States. This can be attributed to increased road and rail operations and/or lack of adequate safety programs based on comprehensive HRGC accidents analysis amidst other reasons. The focus of this study is to predict HRGC accidents in a given rail network based on a machine learning analysis of a similar network with cognate attributes. This study is an improvement on past studies that either attempt to predict accidents in a given HRGC or spatially analyze HRGC accidents for a particular rail line. In this study, a case for a hybrid machine learning and geographic information systems (GIS) approach is presented in a large rail network. The study involves collection and wrangling of relevant data from various sources; exploratory analysis, and supervised machine learning (classification and regression) of HRGC data from 2008 to 2017 in California. The models developed from this analysis were used to make binary predictions [98.9% accuracy & 0.9838 Receiver Operating Characteristic (ROC) score] and quantitative estimations of HRGC casualties in a similar network over the next 10 years. While results are spatially presented in GIS, this novel hybrid application of machine learning and GIS in HRGC accidents’ analysis will help stakeholders to pro-actively engage with casualties through addressing major accident causes as identified in this study. This paper is concluded with a Systems-Action-Management (SAM) approach based on text analysis of HRGC accident risk reports from Federal Railroad Administration.


2021 ◽  
pp. 1-8
Author(s):  
Scott R. Campbell ◽  
Dean M. Resnick ◽  
Christine S. Cox ◽  
Lisa B. Mirel

Record linkage enables survey data to be integrated with other data sources, expanding the analytic potential of both sources. However, depending on the number of records being linked, the processing time can be prohibitive. This paper describes a case study using a supervised machine learning algorithm, known as the Sequential Coverage Algorithm (SCA). The SCA was used to develop the join strategy for two data sources, the National Center for Health Statistics’ (NCHS) 2016 National Hospital Care Survey (NHCS) and the Center for Medicare & Medicaid Services (CMS) Enrollment Database (EDB), during record linkage. Due to the size of the CMS data, common record joining methods (i.e. blocking) were used to reduce the number of pairs that need to be evaluated to identify the vast majority of matches. NCHS conducted a case study examining how the SCA improved the efficiency of blocking. This paper describes how the SCA was used to design the blocking used in this linkage.


2015 ◽  
Vol 27 (6) ◽  
pp. 515-528 ◽  
Author(s):  
Ivana Šemanjski

Travel time forecasting is an interesting topic for many ITS services. Increased availability of data collection sensors increases the availability of the predictor variables but also highlights the high processing issues related to this big data availability. In this paper we aimed to analyse the potential of big data and supervised machine learning techniques in effectively forecasting travel times. For this purpose we used fused data from three data sources (Global Positioning System vehicles tracks, road network infrastructure data and meteorological data) and four machine learning techniques (k-nearest neighbours, support vector machines, boosting trees and random forest). To evaluate the forecasting results we compared them in-between different road classes in the context of absolute values, measured in minutes, and the mean squared percentage error. For the road classes with the high average speed and long road segments, machine learning techniques forecasted travel times with small relative error, while for the road classes with the small average speeds and segment lengths this was a more demanding task. All three data sources were proven itself to have a high impact on the travel time forecast accuracy and the best results (taking into account all road classes) were achieved for the k-nearest neighbours and random forest techniques.


2021 ◽  
Author(s):  
Renan M Costa ◽  
Vijay A Dharmaraj ◽  
Ryota Homma ◽  
Curtis L Neveu ◽  
William B Kristan ◽  
...  

A major limitation of large-scale neuronal recordings is the difficulty in locating the same neuron in different subjects, referred to as the "correspondence" issue. This issue stems, at least in part, from the lack of a unique feature that unequivocally identifies each neuron. One promising approach to this problem is the functional neurocartography framework developed by Frady et al. (2016), in which neurons are identified by a semi-supervised machine learning algorithm using a combination of multiple selected features. Here, the framework was adapted to the buccal ganglia of Aplysia. Multiple features were derived from neuronal activity during motor pattern generation, responses to peripheral nerve stimulation, and the spatial properties of each cell. The feature set was optimized based on its potential usefulness in discriminating neurons from each other, and then used to match putatively homologous neurons across subjects with the functional neurocartography software. A matching method was developed based on a cyclic matching algorithm that allows for unsupervised extraction of groups of neurons, thereby enhancing scalability of the analysis. Cyclic matching was also used to automate the selection of high-quality matches, which allowed for unsupervised implementation of the machine learning algorithm. This study paves the way for investigating the roles of both well-characterized and previously uncharacterized neurons in Aplysia, as well as helps to adapt this framework to other systems.


2021 ◽  
pp. 1-18
Author(s):  
Aaron Erlich ◽  
Stefano G. Dantas ◽  
Benjamin E. Bagozzi ◽  
Daniel Berliner ◽  
Brian Palmer-Rubin

Abstract Political scientists increasingly use supervised machine learning to code multiple relevant labels from a single set of texts. The current “best practice” of individually applying supervised machine learning to each label ignores information on inter-label association(s), and is likely to under-perform as a result. We introduce multi-label prediction as a solution to this problem. After reviewing the multi-label prediction framework, we apply it to code multiple features of (i) access to information requests made to the Mexican government and (ii) country-year human rights reports. We find that multi-label prediction outperforms standard supervised learning approaches, even in instances where the correlations among one’s multiple labels are low.


2018 ◽  
Vol 12 ◽  
pp. 85-98
Author(s):  
Bojan Kostadinov ◽  
Mile Jovanov ◽  
Emil STANKOV

Data collection and machine learning are changing the world. Whether it is medicine, sports or education, companies and institutions are investing a lot of time and money in systems that gather, process and analyse data. Likewise, to improve competitiveness, a lot of countries are making changes to their educational policy by supporting STEM disciplines. Therefore, it’s important to put effort into using various data sources to help students succeed in STEM. In this paper, we present a platform that can analyse student’s activity on various contest and e-learning systems, combine and process the data, and then present it in various ways that are easy to understand. This in turn enables teachers and organizers to recognize talented and hardworking students, identify issues, and/or motivate students to practice and work on areas where they’re weaker.


2020 ◽  
Vol 14 (2) ◽  
pp. 140-159
Author(s):  
Anthony-Paul Cooper ◽  
Emmanuel Awuni Kolog ◽  
Erkki Sutinen

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.


Sign in / Sign up

Export Citation Format

Share Document