Kurator: Tools for Improving Fitness for Use of Biodiversity Data.

As curators of biodiversity data in natural science collections, we are deeply concerned with data quality, but quality is an elusive concept. An effective way to think about data quality is in terms of fitness for use (Veiga 2016). To use data to manage physical collections, the data must be able to accurately answer questions such as what objects are in the collections, where are they and where are they from. Some research uses aggregate data across collections, which involves exchange of data using standard vocabularies. Some research uses require accurate georeferences, collecting dates, and current identifications. It is well understood that the costs of data capture and data quality improvement increase with increasing time from the original observation. These factors point towards two engineering principles for software that is intended to maintain or enhance data quality: build small modular data quality tests that can be easily assembled in suites to assess the fitness of use of data for some particular need; and produce tools that can be applied by users with a wide range of technical skill levels at different points in the data life cycle. In the Kurator project, we have produced code (e.g. Wieczorek et al. 2017, Morris 2016) which consists of small modules that can be incorporated into data management processes as small libraries that address particular data quality tests. These modules can be combined into customizable data quality scripts, which can be run on single computers or scalable architecture and can be incorporated into other software, run as command line programs, or run as suites of canned workflows through a web interface. Kurator modules can be integrated into early stage data capture applications, run to help prepare data for aggregation by matching it to standard vocabularies, be run for quality control or quality assurance on data sets, and can report on data quality in terms of a fitness-for-use framework (Veiga et al. 2017). One of our goals is simple tests usable by anyone anywhere.

Download Full-text

Comparative Cladistics: Fossils, Morphological Data Partitions and Lost Branches in the Fossil Tree of Life

10.31237/osf.io/7sa6d ◽

2017 ◽

Author(s):

Ross Mounce

Keyword(s):

Large Scale ◽

Phylogenetic Signal ◽

Scientific Progress ◽

Molecular Data ◽

Morphological Data ◽

Data Sets ◽

Phylogenetic Groups ◽

Use Of Data ◽

Wide Range ◽

Cladistic Analyses

In this thesis I attempt to gather together a wide range of cladistic analyses of fossil and extant taxa representing a diverse array of phylogenetic groups. I use this data to quantitatively compare the effect of fossil taxa relative to extant taxa in terms of support for relationships, number of most parsimonious trees (MPTs) and leaf stability. In line with previous studies I find that the effects of fossil taxa are seldom different to extant taxa – although I highlight some interesting exceptions. I also use this data to compare the phylogenetic signal within vertebrate morphological data sets, by choosing to compare cranial data to postcranial data. Comparisons between molecular data and morphological data have been previously well explored, as have signals between different molecular loci. But comparative signal within morphological data sets is much less commonly characterized and certainly not across a wide array of clades. With this analysis I show that there are many studies in which the evidence provided by cranial data appears to be be significantly incongruent with the postcranial data – more than one would expect to see just by the effect of chance and noise alone. I devise and implement a modification to a rarely used measure of homoplasy that will hopefully encourage its wider usage. Previously it had some undesirable bias associated with the distribution of missing data in a dataset, but my modification controls for this. I also take an in-depth and extensive review of the ILD test, noting it is often misused or reported poorly, even in recent studies. Finally, in attempting to collect data and metadata on a large scale, I uncovered inefficiencies in the research publication system that obstruct re-use of data and scientific progress. I highlight the importance of replication and reproducibility – even simple reanalysis of high profile papers can turn up some very different results. Data is highly valuable and thus it must be retained and made available for further re-use to maximize the overall return on research investment.

Download Full-text

Improving Data Quality: Actors, Incentives, and Capabilities

Political Analysis ◽

10.1093/pan/mpm007 ◽

2007 ◽

Vol 15 (4) ◽

pp. 365-386 ◽

Cited By ~ 42

Author(s):

Yoshiko M. Herrera ◽

Devesh Kapur

Keyword(s):

Data Quality ◽

International Organizations ◽

Data Sets ◽

Data Set ◽

The Core ◽

Use Of Data ◽

Private Organizations ◽

Assess Data Quality ◽

Existing Data ◽

Shape Data

This paper examines the construction and use of data sets in political science. We focus on three interrelated questions: How might we assess data quality? What factors shape data quality? and How can these factors be addressed to improve data quality? We first outline some problems with existing data set quality, including issues of validity, coverage, and accuracy, and we discuss some ways of identifying problems as well as some consequences of data quality problems. The core of the paper addresses the second question by analyzing the incentives and capabilities facing four key actors in a data supply chain: respondents, data collection agencies (including state bureaucracies and private organizations), international organizations, and finally, academic scholars. We conclude by making some suggestions for improving the use and construction of data sets.It is a capital mistake, Watson, to theorise before you have all the evidence. It biases the judgment.—Sherlock Holmes in “A Study in Scarlet”Statistics make officials, and officials make statistics.”—Chinese proverb

Download Full-text

Photonic Crystal Nanobeam Cavities for Nanoscale Optical Sensing: A Review

Micromachines ◽

10.3390/mi11010072 ◽

2020 ◽

Vol 11 (1) ◽

pp. 72 ◽

Cited By ~ 2

Author(s):

Da-Quan Yang ◽

Bing Duan ◽

Xiao Liu ◽

Ai-Qiang Wang ◽

Xiao-Gang Li ◽

...

Keyword(s):

Photonic Crystal ◽

Optical Waveguides ◽

Optical Sensing ◽

Early Stage ◽

Disease Diagnosis ◽

Label Free ◽

Quality Factors ◽

Integrated Sensor ◽

Early Stage Disease ◽

Wide Range

The ability to detect nanoscale objects is particular crucial for a wide range of applications, such as environmental protection, early-stage disease diagnosis and drug discovery. Photonic crystal nanobeam cavity (PCNC) sensors have attracted great attention due to high-quality factors and small-mode volumes (Q/V) and good on-chip integrability with optical waveguides/circuits. In this review, we focus on nanoscale optical sensing based on PCNC sensors, including ultrahigh figure of merit (FOM) sensing, single nanoparticle trapping, label-free molecule detection and an integrated sensor array for multiplexed sensing. We believe that the PCNC sensors featuring ultracompact footprint, high monolithic integration capability, fast response and ultrahigh sensitivity sensing ability, etc., will provide a promising platform for further developing lab-on-a-chip devices for biosensing and other functionalities.

Download Full-text

Non-Albumin Proteinuria (NAP) as a Complementary Marker for Diabetic Kidney Disease (DKD)

Life ◽

10.3390/life11030224 ◽

2021 ◽

Vol 11 (3) ◽

pp. 224

Author(s):

Jaehyun Bae ◽

Young Jun Won ◽

Byung-Wan Lee

Keyword(s):

Kidney Disease ◽

Filtration Rate ◽

Diabetic Kidney Disease ◽

Estimated Glomerular Filtration Rate ◽

Early Stage ◽

Screening Tools ◽

Tubular Injury ◽

Current Standard ◽

Diabetic Kidney ◽

Wide Range

Diabetic kidney disease (DKD) is one of the most common forms of chronic kidney disease. Its pathogenic mechanism is complex, and it can affect entire structures of the kidney. However, conventional approaches to early stage DKD have focused on changes to the glomerulus. Current standard screening tools for DKD, albuminuria, and estimated glomerular filtration rate are insufficient to reflect early tubular injury. Therefore, many tubular biomarkers have been suggested. Non-albumin proteinuria (NAP) contains a wide range of tubular biomarkers and is convenient to measure. We reviewed the clinical meanings of NAP and its significance as a marker for early stage DKD.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

Contagion Management at the Méditerranée Infection University Hospital Institute

Journal of Clinical Medicine ◽

10.3390/jcm10122627 ◽

2021 ◽

Vol 10 (12) ◽

pp. 2627

Author(s):

Pierre-Edouard Fournier ◽

Sophie Edouard ◽

Nathalie Wurtz ◽

Justine Raclot ◽

Marion Bechet ◽

...

Keyword(s):

Infectious Disease ◽

Early Diagnosis ◽

Personal Protective Equipment ◽

Early Stage ◽

Point Of Care ◽

University Hospital ◽

Protective Equipment ◽

Monitoring Point ◽

Epidemiological Monitoring ◽

Wide Range

The Méditerranée Infection University Hospital Institute (IHU) is located in a recent building, which includes experts on a wide range of infectious disease. The IHU strategy is to develop innovative tools, including epidemiological monitoring, point-of-care laboratories, and the ability to mass screen the population. In this study, we review the strategy and guidelines proposed by the IHU and its application to the COVID-19 pandemic and summarise the various challenges it raises. Early diagnosis enables contagious patients to be isolated and treatment to be initiated at an early stage to reduce the microbial load and contagiousness. In the context of the COVID-19 pandemic, we had to deal with a shortage of personal protective equipment and reagents and a massive influx of patients. Between 27 January 2020 and 5 January 2021, 434,925 nasopharyngeal samples were tested for the presence of SARS-CoV-2. Of them, 12,055 patients with COVID-19 were followed up in our out-patient clinic, and 1888 patients were hospitalised in the Institute. By constantly adapting our strategy to the ongoing situation, the IHU has succeeded in expanding and upgrading its equipment and improving circuits and flows to better manage infected patients.

Download Full-text

Survey Data Quality in Analyzing Harmonized Indicators of Protest Behavior: A Survey Data Recycling Approach

American Behavioral Scientist ◽

10.1177/00027642211021623 ◽

2021 ◽

pp. 000276422110216

Author(s):

Kazimierz M. Slomczynski ◽

Irina Tomescu-Dubrow ◽

Ilona Wysmulek

Keyword(s):

Data Processing ◽

Data Quality ◽

Survey Data ◽

A Priori ◽

Data Sets ◽

New Approach ◽

Survey Quality ◽

Survey Error ◽

Ex Post ◽

The Impact

This article proposes a new approach to analyze protest participation measured in surveys of uneven quality. Because single international survey projects cover only a fraction of the world’s nations in specific periods, researchers increasingly turn to ex-post harmonization of different survey data sets not a priori designed as comparable. However, very few scholars systematically examine the impact of the survey data quality on substantive results. We argue that the variation in source data, especially deviations from standards of survey documentation, data processing, and computer files—proposed by methodologists of Total Survey Error, Survey Quality Monitoring, and Fitness for Intended Use—is important for analyzing protest behavior. In particular, we apply the Survey Data Recycling framework to investigate the extent to which indicators of attending demonstrations and signing petitions in 1,184 national survey projects are associated with measures of data quality, controlling for variability in the questionnaire items. We demonstrate that the null hypothesis of no impact of measures of survey quality on indicators of protest participation must be rejected. Measures of survey documentation, data processing, and computer records, taken together, explain over 5% of the intersurvey variance in the proportions of the populations attending demonstrations or signing petitions.

Download Full-text

Design of Personalized Devices—The Tradeoff between Individual Value and Personalization Workload

Applied Sciences ◽

10.3390/app11010241 ◽

2020 ◽

Vol 11 (1) ◽

pp. 241

Author(s):

Juliane Kuhl ◽

Andreas Ding ◽

Ngoc Tuan Ngo ◽

Andres Braschkat ◽

Jens Fiehler ◽

...

Keyword(s):

Customer Value ◽

Early Stage ◽

Treatment Success ◽

Product Family ◽

Flow Diverter ◽

New Method ◽

Product Families ◽

Wide Range ◽

The Individual ◽

Individual Value

Personalized medical devices adapted to the anatomy of the individual promise greater treatment success for patients, thus increasing the individual value of the product. In order to cater to individual adaptations, however, medical device companies need to be able to handle a wide range of internal processes and components. These are here referred to collectively as the personalization workload. Consequently, support is required in order to evaluate how best to target product personalization. Since the approaches presented in the literature are not able to sufficiently meet this demand, this paper introduces a new method that can be used to define an appropriate variety level for a product family taking into account standardized, variant, and personalized attributes. The new method enables the identification and evaluation of personalizable attributes within an existing product family. The method is based on established steps and tools from the field of variant-oriented product design, and is applied using a flow diverter—an implant for the treatment of aneurysm diseases—as an example product. The personalization relevance and adaptation workload for the product characteristics that constitute the differentiating product properties were analyzed and compared in order to determine a tradeoff between customer value and personalization workload. This will consequently help companies to employ targeted, deliberate personalization when designing their product families by enabling them to factor variety-induced complexity and customer value into their thinking at an early stage, thus allowing them to critically evaluate a personalization project.

Download Full-text

Integration of early disease-resistance phenotyping, histological characterization, and transcriptome sequencing reveals insights into downy mildew resistance in impatiens

Horticulture Research ◽

10.1038/s41438-021-00543-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Ze Peng ◽

Yanhong He ◽

Saroj Parajuli ◽

Qian You ◽

Weining Wang ◽

...

Keyword(s):

Disease Resistance ◽

Downy Mildew ◽

Transcriptome Sequencing ◽

Early Stage ◽

Economic Losses ◽

Potential Candidate ◽

Early Disease ◽

True Leaf ◽

Cotyledon Stage ◽

Wide Range

AbstractDowny mildew (DM), caused by obligate parasitic oomycetes, is a destructive disease for a wide range of crops worldwide. Recent outbreaks of impatiens downy mildew (IDM) in many countries have caused huge economic losses. A system to reveal plant–pathogen interactions in the early stage of infection and quickly assess resistance/susceptibility of plants to DM is desired. In this study, we established an early and rapid system to achieve these goals using impatiens as a model. Thirty-two cultivars of Impatiens walleriana and I. hawkeri were evaluated for their responses to IDM at cotyledon, first/second pair of true leaf, and mature plant stages. All I. walleriana cultivars were highly susceptible to IDM. While all I. hawkeri cultivars were resistant to IDM starting at the first true leaf stage, many (14/16) were susceptible to IDM at the cotyledon stage. Two cultivars showed resistance even at the cotyledon stage. Histological characterization showed that the resistance mechanism of the I. hawkeri cultivars resembles that in grapevine and type II resistance in sunflower. By integrating full-length transcriptome sequencing (Iso-Seq) and RNA-Seq, we constructed the first reference transcriptome for Impatiens comprised of 48,758 sequences with an N50 length of 2060 bp. Comparative transcriptome and qRT-PCR analyses revealed strong candidate genes for IDM resistance, including three resistance genes orthologous to the sunflower gene RGC203, a potential candidate associated with DM resistance. Our approach of integrating early disease-resistance phenotyping, histological characterization, and transcriptome analysis lay a solid foundation to improve DM resistance in impatiens and may provide a model for other crops.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text