Darwin Core Spatial Processor (DwCSP): a Fast Biodiversity Occurrences Curator

Primary biodiversity data, or occurrence data, are being produced at an increasing rate and are used in numerous studies (Hampton et al. 2013, La Salle et al. 2016). This data avalanche is a remarkable opportunity but it comes with hurdles. First, available software solutions are rare for very large datasets and those solutions often require significant computer skills (Gaiji et al. 2013), while most biologists are not formally trained in bioinformatics (List et al. 2017). Second, large datasets are heterogeneous because they come from different producers and they can contain erroneous data (Gaiji et al. 2013). Hence, they need to be curated. In this context, we developed a biodiversity occurrence curator designed to quickly handle large amounts of data through a simple interface: the Darwin Core Spatial Processor (DwCSP). DwCSP does not require the installation or use of third-party software and has a simple graphical user interface that requires no computer knowledge. DwCSP allows for the data enrichment of biodiversity occurrences and also ensures data quality through outlier detection. For example, the software can enrich a tabulated occurrence file (Darwin Core for instance) with spatial data from polygon files (e.g., Esri shapefile) or a Rasters file (geotiff). The speed of the enriching procedures is ensured through multithreading and optimized spatial access methods (R-Tree indexes). DwCSP can also detect and tag outliers based on their geographic coordinates or environmental variables. The first type of outlier detection uses a computed distance between the occurrence and its nearest neighbors, whereas the second type uses a Mahalanobis distance (Mahalanobis 1936). One hundred thousand occurrences can be processed by DwCSP in less than 20 minutes and another test on forty million occurrences was completed in a few days on a recent personal computer. DwCSP has an English interface including documentation and will be available as a stand-alone Java Archive (JAR) executable that works on all computers having a Java environment (version 1.8 and onward).

Download Full-text

DBSCOUT: A Density-based Method for Scalable Outlier Detection in Very Large Datasets

2021 IEEE 37th International Conference on Data Engineering (ICDE) ◽

10.1109/icde51399.2021.00011 ◽

2021 ◽

Author(s):

Matteo Corain ◽

Paolo Garza ◽

Abolfazl Asudeh

Keyword(s):

Outlier Detection ◽

Large Datasets ◽

Very Large Datasets

Download Full-text

Matters of geoportal interfaces designing

Geodesy and Cartography ◽

10.22389/0016-7126-2019-944-2-46-56 ◽

2019 ◽

Vol 944 (2) ◽

pp. 46-56

Author(s):

S.A. Yamashkin ◽

A.A. Yamashkin ◽

O.A. Zarubin

Keyword(s):

Spatial Data ◽

Interface Design ◽

Third Party ◽

Management Systems ◽

Data Management Systems ◽

Web Interfaces ◽

Graphical Interfaces ◽

Spatial Data Management ◽

Software Modules ◽

Cross Platform

The article is devoted to a detailed analysis of the problem of designing graphic geoportal interfaces. The authors formulated the basic points for solving problems in this field, having given the rationale and detailed description of each of them. The emphasis is made on the flexible arrangement of the design and development of interfaces, aiming at the future realities, at the human centricity of the interface design process, at the need for cross-platform adaptive web interfaces, at the preference to use proprietary and third-party software modules over the implementation of spatial data management systems. Lists of basic functional and quality requirements for graphical interfaces of geoportals are given. The geoportal “Natural and cultural heritage of Mordovia” is presented as an illustrative example of the various implementation of graphical user web interfaces. An experimental assessment of the effectiveness of measures to improve geoportal graphical interfaces is given. It is shown that properly over-thought interfaces of geoportal systems can contribute to solving various kinds of problems in many fields.

Download Full-text

Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

Knowledge-Based Systems ◽

10.1016/j.knosys.2020.105777 ◽

2020 ◽

Vol 196 ◽

pp. 105777

Author(s):

Jadson Jose Monteiro Oliveira ◽

Robson Leonardo Ferreira Cordeiro

Keyword(s):

Dimensionality Reduction ◽

Large Datasets ◽

Very Large Datasets ◽

The Right

Download Full-text

Pairwise likelihood inference for spatial regressions estimated on very large datasets

Spatial Statistics ◽

10.1016/j.spasta.2013.10.001 ◽

2014 ◽

Vol 7 ◽

pp. 21-39 ◽

Cited By ~ 10

Author(s):

Giuseppe Arbia

Keyword(s):

Large Datasets ◽

Likelihood Inference ◽

Pairwise Likelihood ◽

Very Large Datasets

Download Full-text

Scalable computation of streamlines on very large datasets

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 ◽

10.1145/1654059.1654076 ◽

2009 ◽

Cited By ~ 40

Author(s):

Dave Pugmire ◽

Hank Childs ◽

Christoph Garth ◽

Sean Ahern ◽

Gunther H. Weber

Keyword(s):

Large Datasets ◽

Very Large Datasets ◽

Scalable Computation

Download Full-text

Situating Politics: Spatial Heterogeneity and the Study of Political History

10.31235/osf.io/gy9fj ◽

2022 ◽

Author(s):

Adam Slez

Keyword(s):

Spatial Data ◽

Quantitative Methods ◽

Political Action ◽

Ethnic Composition ◽

Third Party ◽

Late Nineteenth Century ◽

Use Of Time ◽

Parameter Heterogeneity ◽

Global Regression ◽

Party Support

While quantitative methods are routinely used to examine historical materials, critics take issue with the use of global regression models that attach a single parameter to each predictor, thereby ignoring the effects of time and space, which together define the context in which historical events unfold. This problem can be addressed by allowing for parameter heterogeneity, as highlighted by the proliferation of work on the use of time-varying parameter models. In this paper, I show how this approach can be extended to the case of spatial data using spatially-varying coefficient models, with an eye toward the study of electoral politics, where the use of spatial data is especially common in historical settings. Toward this end, I revisit a critical case in the field of quantitative history: the rise of electoral Populism in the American West in the period between 1890 and 1896. Upending popular narratives about the correlates of third- party support in the late nineteenth century, I show that the association between third- party vote share and traditional predictors such as economic hardship and ethnic composition varied considerably from one place to the next, giving rise to distinct varieties of electoral Populism—a finding that is missed by global models, which mistake the mathematically particular for the historically general. These findings have important theoretical and empirical implications for the study of political action in a world where parameter heterogeneity is increasingly recognized as a standard feature of modern social science.

Download Full-text

Distributed processing of very large datasets with DataCutter

Parallel Computing ◽

10.1016/s0167-8191(01)00099-0 ◽

2001 ◽

Vol 27 (11) ◽

pp. 1457-1478 ◽

Cited By ~ 118

Author(s):

Michael D Beynon ◽

Tahsin Kurc ◽

Umit Catalyurek ◽

Chialin Chang ◽

Alan Sussman ◽

...

Keyword(s):

Distributed Processing ◽

Large Datasets ◽

Very Large Datasets

Download Full-text

An implementation and performance analysis of spatial data access methods

[1989] Proceedings. Fifth International Conference on Data Engineering ◽

10.1109/icde.1989.47268 ◽

2003 ◽

Cited By ~ 45

Author(s):

D. Greene

Keyword(s):

Performance Analysis ◽

Spatial Data ◽

Data Access ◽

Access Methods ◽

And Performance

Download Full-text

Multidimensional Scaling With Very Large Datasets

Journal of Computational and Graphical Statistics ◽

10.1080/10618600.2018.1470001 ◽

2018 ◽

Vol 27 (4) ◽

pp. 935-939 ◽

Cited By ~ 2

Author(s):

Emmanuel Paradis

Keyword(s):

Multidimensional Scaling ◽

Large Datasets ◽

Very Large Datasets

Download Full-text

Object-Relational Spatial Indexing

Spatial Databases ◽

10.4018/978-1-59140-387-6.ch003 ◽

2011 ◽

pp. 49-80

Author(s):

Hans-Peter Kriegel ◽

Martin Pfeifle ◽

Marco Potke ◽

Thomas Seidl ◽

Jost Enderle

Keyword(s):

Spatial Data ◽

Concurrency Control ◽

Buffer Management ◽

Database Systems ◽

Data Types ◽

Spatial Access ◽

Access Methods ◽

Object Relational ◽

Relational Database Systems ◽

Spatial Access Methods

In order to generate efficient execution plans for queries comprising spatial data types and predicates, the database system has to be equipped with appropriate index structures, query processing methods and optimization rules. Although available extensible indexing frameworks provide a gateway for seamless integration of spatial access methods into the standard process of query optimization and execution, they do not facilitate the actual implementation of the spatial access method. An internal enhancement of the database kernel is usually not an option for database developers. The embedding of a custom, block-oriented index structure into concurrency control, recovery services and buffer management would cause extensive implementation efforts and maintenance cost, at the risk of weakening the reliability of the entire system. The server stability can be preserved by delegating index operations to an external process, but this approach induces severe performance bottlenecks due to context switches and inter-process communication. Therefore, we present the paradigm of object-relational spatial access methods that perfectly fits to the common relational data model, and is highly compatible with the extensible indexing frameworks of existing object-relational database systems, allowing the user to define application-specific access methods.

Download Full-text