Biodiversity information retrieval across networked data sets

For those biologists and biodiversity data managers who are unfamiliar with information science data practices of data standardization, the use of complex software to assist in the creation of standardized datasets can be a barrier to sharing data. Since the ratification of the Darwin Core Standard (DwC) (Darwin Core Task Group 2009) by the Biodiversity Information Standards (TDWG) in 2009, many datasets have been published and shared through a variety of data portals. In the early stages of biodiversity data sharing, the protocol Distributed Generic Information Retrieval (DiGIR), progenitor of DwC, and later the protocols BioCASe and TDWG Access Protocol for Information Retrieval (TAPIR) (De Giovanni et al. 2010) were introduced for discovery, search and retrieval of distributed data, simplifying data exchange between information systems. Although these protocols are still in use, they are known to be inefficient for transferring large amounts of data (GBIF 2017). Because of that, in 2011 the Global Biodiversity Information Facility (GBIF) introduced the Darwin Core Archive (DwC-A), which allows more efficient data transfer, and has become the preferred format for publishing data in the GBIF network. DwC-A is a structured collection of text files, which makes use of the DwC terms to produce a single, self-contained dataset. Many tools for assisting data sharing using DwC-A have been introduced, such as the Integrated Publishing Toolkit (IPT) (Robertson et al. 2014), the Darwin Core Archive Assistant (GBIF 2010) and the Darwin Core Archive Validator. Despite promoting and facilitating data sharing, many users have difficulties using such tools, mainly because of the lack of training in information science in the biodiversity curriculum (Convention on Biological Diversiity 2012, Enke et al. 2012). However, most users are very familiar with spreadsheets to store and organize their data, but the adoption of the available solutions requires data transformation and training in information science and more specifically, biodiversity informatics. For an example of how spreadsheets can simplify data sharing see Stoev et al. (2016). In order to provide a more "familiar" approach to data sharing using DwC-A, we introduce a new tool as a Google Sheet Add-on. The Add-on, called Darwin Core Archive Assistant Add-on can be installed in the user's Google Account from the G Suite MarketPlace and used in conjunction with the Google Sheets application. The Add-on assists the mapping of spreadsheet columns/fields to DwC terms (Fig. 1), similar to IPT, but with the advantage that it does not require the user to export the spreadsheet and import it into another software. Additionally, the Add-on facilitates the creation of a star schema in accordance with DwC-A, by the definition of a "CORE_ID" (e.g. occurrenceID, eventID, taxonID) field between sheets of a document (Fig. 2). The Add-on also provides an Ecological Metadata Language (EML) (Jones et al. 2019) editor (Fig. 3) with minimal fields to be filled in (i.e., mandatory fields required by IPT), and helps users to generate and share DwC-Archives stored in the user's Google Drive, which can be downloaded as a DwC-A or automatically uploaded to another public storage resource like a user's Zenodo Account (Fig. 4). We expect that the Google Sheet Add-on introduced here, in conjunction with IPT, will promote biodiversity data sharing in a standardized format, as it requires minimal training and simplifies the process of data sharing from the user's perspective, mainly for those users not familiar with IPT, but that historically have worked with spreadsheets. Although the DwC-A generated by the add-on still needs to be published using IPT, it does provide a simpler interface (i.e., spreadsheet) for mapping data sets to DwC than IPT. Even though the IPT includes many more features than the Darwin Core Assistant Add-on, we expect that the Add-on can be a "starting point" for users unfamiliar with biodiversity informatics before they move on to more advanced data publishing tools. On the other hand, Zenodo integration allows users to share and cite their standardized data sets without publishing them via IPT, which can be useful for users without access to an IPT installation. Additionally, we are working on new features and future releases will include the automatic generation of Global Unique Identifiers for shared records, the possibility of adding additional data standards and DwC extensions, integration with GBIF REST API and with IPT REST API.

Download Full-text

Biodiversity Information Retrieval Through Large Scale Content-Based Identification: A Long-Term Evaluation

Information Retrieval Evaluation in a Changing World - The Information Retrieval Series ◽

10.1007/978-3-030-22948-1_16 ◽

2019 ◽

pp. 389-413

Author(s):

Alexis Joly ◽

Hervé Goëau ◽

Hervé Glotin ◽

Concetto Spampinato ◽

Pierre Bonnet ◽

...

Keyword(s):

Information Retrieval ◽

Large Scale ◽

Term Evaluation ◽

Biodiversity Information

Download Full-text

Information retrieval from heterogeneous data sets using moderated IDF-cosine similarity in vector space model

2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) ◽

10.1109/icecds.2017.8390174 ◽

2017 ◽

Cited By ~ 1

Author(s):

Bhagyashree Pathak ◽

Niranjan Lal

Keyword(s):

Information Retrieval ◽

Vector Space ◽

Vector Space Model ◽

Heterogeneous Data ◽

Cosine Similarity ◽

Data Sets ◽

Space Model

Download Full-text

A Multi-disciplinary Approach to Interactive Information Retrieval upon Semi-structured Data Sets

10.14236/ewic/fdia2009.15 ◽

2009 ◽

Author(s):

Corrado Boscarino

Keyword(s):

Information Retrieval ◽

Structured Data ◽

Data Sets ◽

Interactive Information Retrieval

Download Full-text

Locality-Sensitive Hashing for Information Retrieval System on Multiple GPGPU Devices

Applied Sciences ◽

10.3390/app10072539 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2539 ◽

Cited By ~ 1

Author(s):

Toan Nguyen Mau ◽

Yasushi Inoguchi

Keyword(s):

Big Data ◽

Information Retrieval ◽

Retrieval System ◽

Hash Table ◽

Information Retrieval System ◽

Main Memory ◽

Locality Sensitive Hashing ◽

Data Sets ◽

Similar Data ◽

Data Set

It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.

Download Full-text

An Intelligent Web Caching System for Improving the Performance of a Web-Based Information Retrieval System

International Journal on Semantic Web and Information Systems ◽

10.4018/ijswis.2020100102 ◽

2020 ◽

Vol 16 (4) ◽

pp. 26-44

Author(s):

Sathiyamoorthi V. ◽

Suresh P. ◽

Jayapandian N. ◽

Kanmani P. ◽

Deva Priya M. ◽

...

Keyword(s):

Information Retrieval ◽

Web Caching ◽

Web Pages ◽

Data Sets ◽

Data Traffic ◽

Content Delivery Network ◽

Web Based ◽

Long Time ◽

User Access ◽

The Web

With an increasing number of web users, the data traffic generated by these users generates tremendous network traffic which takes a long time to connect with the web server. The main reason is, the distance between the client making requests and the servers responding to those requests. The use of the CDN (content delivery network) is one of the strategies for minimizing latency. But, it incurs additional cost. Alternatively, web caching and preloading are the most viable approaches to this issue. It is therefore decided to introduce a novel web caching strategy called optimized popularity-aware modified least frequently used (PMLFU) policy for information retrieval based on users' past access history and their trends analysis. It helps to enhance the proxy-driven web caching system by analyzing user access requests and caching the most popular web pages driven on their preferences. Experimental results show that the proposed systems can significantly reduce the user delay in accessing the web page. The performance of the proposed system is measured using IRCACHE data sets in real time.

Download Full-text

Interactive Information Retrieval as a Step Towards Effective Knowledge Management in Healthcare

Medical Informatics ◽

10.4018/978-1-60566-050-9.ch021 ◽

2011 ◽

pp. 240-256

Author(s):

Jörg Ontrup

Keyword(s):

Knowledge Management ◽

Information Retrieval ◽

Medical Knowledge ◽

Hospital Information Systems ◽

Data Sets ◽

Semantic Organization ◽

Huge Data ◽

Modern Information ◽

Novel Algorithms ◽

Data Collections

The chapter shows how modern information retrieval methodologies can open up new possibilities to support knowledge management in healthcare. Recent advances in hospital information systems lead to the acquisition of huge quantities of data, often characterized by a high proportion of free narrative text embedded in the electronic health record. We point out how text mining techniques augmented by novel algorithms that combine artificial neural networks for the semantic organization of non-crisp data and hyperbolic geometry for an intuitive navigation in huge data sets can offer efficient tools to make medical knowledge in such data collections more accessible to the medical expert by providing context information and links to knowledge buried in medical literature databases.

Download Full-text

Information retrieval from large data sets via multiple-winners-take-all

2011 IEEE International Symposium of Circuits and Systems (ISCAS) ◽

10.1109/iscas.2011.5938154 ◽

2011 ◽

Cited By ~ 3

Author(s):

Zhishan Guo ◽

Jun Wang

Keyword(s):

Information Retrieval ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Take All

Download Full-text

The Automated VSMs to Categorize Arabic Text Data Sets

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v13i1.2925 ◽

2014 ◽

Vol 13 (1) ◽

pp. 4074-4081

Author(s):

Mamoun Suleiman Al Rababaa ◽

Essam Said Hanandeh

Keyword(s):

Data Mining ◽

Information Retrieval ◽

Vector Space ◽

Text Categorization ◽

Experimental Results ◽

Data Sets ◽

Arabic Text ◽

Text Data ◽

Evaluation Measures ◽

Vector Space Models

Text Categorization is one of the most important tasks in information retrieval and data mining. This paper aims at investigating different variations of vector space models (VSMs) using KNN algorithm. we used 242 Arabic abstract documents that were used by (Hmeidi & Kanaan, 1997). The bases of our comparison are the most popular text evaluation measures; we use Recall measure, Precision measure, and F1 measure. The Experimental results against the Saudi data sets reveal that Cosine outperformed over of the Dice and Jaccard coefficients.

Download Full-text

Using Semantics for Granularities of Tokenization

Computational Linguistics ◽

10.1162/coli_a_00325 ◽

2018 ◽

Vol 44 (3) ◽

pp. 483-524 ◽

Cited By ~ 2

Author(s):

Martin Riedl ◽

Chris Biemann

Keyword(s):

Information Retrieval ◽

Data Sets ◽

Bag Of Words ◽

Semantic Model ◽

Single Word ◽

Evaluation Data ◽

Pos Tagging ◽

Distributional Information ◽

Supervised Methods ◽

Set Up

Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.

Download Full-text