GigaScience | ScienceGate

Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating Pacific Biosciences long reads and a high-density linkage map

GigaScience ◽

10.1093/gigascience/giab097 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Olli-Pekka Smolander ◽

Daniel Blande ◽

Virpi Ahola ◽

Pasi Rastas ◽

Jaakko Tanskanen ◽

...

Keyword(s):

Linkage Map ◽

Metapopulation Dynamics ◽

Melitaea Cinxia ◽

Pacific Biosciences ◽

Final Assembly ◽

Long Reads ◽

Glanville Fritillary Butterfly ◽

Gene Models ◽

High Density Linkage Map ◽

Chromosome Level

Abstract Background The Glanville fritillary (Melitaea cinxia) butterfly is a model system for metapopulation dynamics research in fragmented landscapes. Here, we provide a chromosome-level assembly of the butterfly's genome produced from Pacific Biosciences sequencing of a pool of males, combined with a linkage map from population crosses. Results The final assembly size of 484 Mb is an increase of 94 Mb on the previously published genome. Estimation of the completeness of the genome with BUSCO indicates that the genome contains 92–94% of the BUSCO genes in complete and single copies. We predicted 14,810 genes using the MAKER pipeline and manually curated 1,232 of these gene models. Conclusions The genome and its annotated gene models are a valuable resource for future comparative genomics, molecular biology, transcriptome, and genetics studies on this species.

Citation needed? Wikipedia bibliometrics during the first wave of the COVID-19 pandemic

GigaScience ◽

10.1093/gigascience/giab095 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Omer Benjakob ◽

Rona Aviram ◽

Jonathan Aryeh Sobel

Keyword(s):

English Language ◽

Shared Knowledge ◽

Quality Of Information ◽

New Information ◽

Unique Model ◽

Key Topics ◽

Digital Knowledge ◽

Scientific Infrastructure ◽

Insight Into

Abstract Background With the COVID-19 pandemic’s outbreak, millions flocked to Wikipedia for updated information. Amid growing concerns regarding an “infodemic,” ensuring the quality of information is a crucial vector of public health. Investigating whether and how Wikipedia remained up to date and in line with science is key to formulating strategies to counter misinformation. Using citation analyses, we asked which sources informed Wikipedia’s COVID-19–related articles before and during the pandemic’s first wave (January–May 2020). Results We found that coronavirus-related articles referenced trusted media outlets and high-quality academic sources. Regarding academic sources, Wikipedia was found to be highly selective in terms of what science was cited. Moreover, despite a surge in COVID-19 preprints, Wikipedia had a clear preference for open-access studies published in respected journals and made little use of preprints. Building a timeline of English-language COVID-19 articles from 2001–2020 revealed a nuanced trade-off between quality and timeliness. It further showed how pre-existing articles on key topics related to the virus created a framework for integrating new knowledge. Supported by a rigid sourcing policy, this “scientific infrastructure” facilitated contextualization and regulated the influx of new information. Last, we constructed a network of DOI-Wikipedia articles, which showed the landscape of pandemic-related knowledge on Wikipedia and how academic citations create a web of shared knowledge supporting topics like COVID-19 drug development. Conclusions Understanding how scientific research interacts with the digital knowledge-sphere during the pandemic provides insight into how Wikipedia can facilitate access to science. It also reveals how, aided by what we term its “citizen encyclopedists,” it successfully fended off COVID-19 disinformation and how this unique model may be deployed in other contexts.

Inferring microbiota functions from taxonomic genes: a review

GigaScience ◽

10.1093/gigascience/giab090 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Christophe Djemiel ◽

Pierre-Alain Maron ◽

Sébastien Terrat ◽

Samuel Dequiedt ◽

Aurélien Cottin ◽

...

Keyword(s):

High Throughput Sequencing ◽

Objective Evaluation ◽

Relevant Information ◽

Dna Metabarcoding ◽

Scientific Papers ◽

Operational Diagnosis ◽

Functional Inference ◽

The Individual ◽

Gene Information ◽

Reference Genomes

Abstract Deciphering microbiota functions is crucial to predict ecosystem sustainability in response to global change. High-throughput sequencing at the individual or community level has revolutionized our understanding of microbial ecology, leading to the big data era and improving our ability to link microbial diversity with microbial functions. Recent advances in bioinformatics have been key for developing functional prediction tools based on DNA metabarcoding data and using taxonomic gene information. This cheaper approach in every aspect serves as an alternative to shotgun sequencing. Although these tools are increasingly used by ecologists, an objective evaluation of their modularity, portability, and robustness is lacking. Here, we reviewed 100 scientific papers on functional inference and ecological trait assignment to rank the advantages, specificities, and drawbacks of these tools, using a scientific benchmarking. To date, inference tools have been mainly devoted to bacterial functions, and ecological trait assignment tools, to fungal functions. A major limitation is the lack of reference genomes—compared with the human microbiota—especially for complex ecosystems such as soils. Finally, we explore applied research prospects. These tools are promising and already provide relevant information on ecosystem functioning, but standardized indicators and corresponding repositories are still lacking that would enable them to be used for operational diagnosis.

Chromosome-level genome assembly of the shuttles hoppfish, Periophthalmus modestus

GigaScience ◽

10.1093/gigascience/giab089 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Youngik Yang ◽

Ji Yong Yoo ◽

Sang Ho Baek ◽

Ha Yeun Song ◽

Seonmi Jo ◽

...

Keyword(s):

Genome Assembly ◽

Tandem Repeats ◽

Gene Annotation ◽

Gene Prediction ◽

Gene Families ◽

Nitrogen Excretion ◽

Contact Map ◽

Long Interspersed Nuclear Elements ◽

Simple Repeats

Abstract Background The shuttles hoppfish (mudskipper), Periophthalmus modestus, is one of the mudskippers, which are the largest group of amphibious teleost fishes, which are uniquely adapted to live on mudflats. Because mudskippers can survive on land for extended periods by breathing through their skin and through the lining of the mouth and throat, they were evaluated as a model for the evolutionary sea-land transition of Devonian protoamphibians, ancestors of all present tetrapods. Results A total of 39.6, 80.2, 52.9, and 33.3 Gb of Illumina, Pacific Biosciences, 10X linked, and Hi-C data, respectively, was assembled into 1,419 scaffolds with an N50 length of 33 Mb and BUSCO score of 96.6%. The assembly covered 117% of the estimated genome size (729 Mb) and included 23 pseudo-chromosomes anchored by a Hi-C contact map, which corresponded to the top 23 longest scaffolds above 20 Mb and close to the estimated one. Of the genome, 43.8% were various repetitive elements such as DNAs, tandem repeats, long interspersed nuclear elements, and simple repeats. Ab initio and homology-based gene prediction identified 30,505 genes, of which 94% had homology to the 14 Actinopterygii transcriptomes and 89% and 85% to Pfam familes and InterPro domains, respectively. Comparative genomics with 15 Actinopterygii species identified 59,448 gene families of which 12% were only in P. modestus. Conclusions We present the high quality of the first genome assembly and gene annotation of the shuttles hoppfish. It will provide a valuable resource for further studies on sea-land transition, bimodal respiration, nitrogen excretion, osmoregulation, thermoregulation, vision, and mechanoreception.

Halvade somatic: Somatic variant calling with Apache Spark

GigaScience ◽

10.1093/gigascience/giab094 ◽

2022 ◽

Vol 11 (1) ◽

Author(s):

Dries Decap ◽

Louise de Schaetzen van Brienen ◽

Maarten Larmuseau ◽

Pascal Costanza ◽

Charlotte Herzeel ◽

...

Keyword(s):

Best Practices ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Computing Time ◽

Variant Calling ◽

Apache Spark ◽

Normal Sample ◽

Whole Genome ◽

Sequencing Data ◽

Somatic Variant

Abstract Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Multi-dimensional leaf phenotypes reflect root system genotype in grafted grapevine over the growing season

GigaScience ◽

10.1093/gigascience/giab087 ◽

2021 ◽

Vol 10 (12) ◽

Cited By ~ 1

Author(s):

Zachary N Harris ◽

Mani Awale ◽

Niyati Bhakta ◽

Daniel H Chitwood ◽

Anne Fennell ◽

...

Keyword(s):

Root System ◽

Experimental System ◽

Growing Season ◽

Plant Biology ◽

Shoot System ◽

Distinct Individual ◽

Below Ground ◽

Ground Effects ◽

Horticultural Practice ◽

Broad Understanding

Abstract Background Modern biological approaches generate volumes of multi-dimensional data, offering unprecedented opportunities to address biological questions previously beyond reach owing to small or subtle effects. A fundamental question in plant biology is the extent to which below-ground activity in the root system influences above-ground phenotypes expressed in the shoot system. Grafting, an ancient horticultural practice that fuses the root system of one individual (the rootstock) with the shoot system of a second, genetically distinct individual (the scion), is a powerful experimental system to understand below-ground effects on above-ground phenotypes. Previous studies on grafted grapevines have detected rootstock influence on scion phenotypes including physiology and berry chemistry. However, the extent of the rootstock's influence on leaves, the photosynthetic engines of the vine, and how those effects change over the course of a growing season, are still largely unknown. Results Here, we investigate associations between rootstock genotype and shoot system phenotypes using 5 multi-dimensional leaf phenotyping modalities measured in a common grafted scion: ionomics, metabolomics, transcriptomics, morphometrics, and physiology. Rootstock influence is ubiquitous but subtle across modalities, with the strongest signature of rootstock observed in the leaf ionome. Moreover, we find that the extent of rootstock influence on scion phenotypes and patterns of phenomic covariation are highly dynamic across the season. Conclusions These findings substantially expand previously identified patterns to demonstrate that rootstock influence on scion phenotypes is complex and dynamic and underscore that broad understanding necessitates volumes of multi-dimensional data previously unmet.

A high-quality genome and comparison of short- versus long-read transcriptome of the palaearctic duck Aythya fuligula (tufted duck)

GigaScience ◽

10.1093/gigascience/giab081 ◽

2021 ◽

Vol 10 (12) ◽

Cited By ~ 1

Author(s):

Ralf C Mueller ◽

Patrik Ellström ◽

Kerstin Howe ◽

Marcela Uliano-Silva ◽

Richard I Kuo ◽

...

Keyword(s):

Avian Influenza ◽

Genome Assembly ◽

Host Response ◽

Influenza Viruses ◽

Avian Influenza Viruses ◽

High Quality ◽

Pathogenic Avian Influenza ◽

Tufted Duck ◽

Long Read ◽

High Quality Genome

Abstract Background The tufted duck is a non-model organism that experiences high mortality in highly pathogenic avian influenza outbreaks. It belongs to the same bird family (Anatidae) as the mallard, one of the best-studied natural hosts of low-pathogenic avian influenza viruses. Studies in non-model bird species are crucial to disentangle the role of the host response in avian influenza virus infection in the natural reservoir. Such endeavour requires a high-quality genome assembly and transcriptome. Findings This study presents the first high-quality, chromosome-level reference genome assembly of the tufted duck using the Vertebrate Genomes Project pipeline. We sequenced RNA (complementary DNA) from brain, ileum, lung, ovary, spleen, and testis using Illumina short-read and Pacific Biosciences long-read sequencing platforms, which were used for annotation. We found 34 autosomes plus Z and W sex chromosomes in the curated genome assembly, with 99.6% of the sequence assigned to chromosomes. Functional annotation revealed 14,099 protein-coding genes that generate 111,934 transcripts, which implies a mean of 7.9 isoforms per gene. We also identified 246 small RNA families. Conclusions This annotated genome contributes to continuing research into the host response in avian influenza virus infections in a natural reservoir. Our findings from a comparison between short-read and long-read reference transcriptomics contribute to a deeper understanding of these competing options. In this study, both technologies complemented each other. We expect this annotation to be a foundation for further comparative and evolutionary genomic studies, including many waterfowl relatives with differing susceptibilities to avian influenza viruses.

Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects

GigaScience ◽

10.1093/gigascience/giab077 ◽

2021 ◽

Vol 10 (12) ◽

Cited By ~ 1

Author(s):

Nathan C Sheffield ◽

Michał Stolarczyk ◽

Vincent P Reuter ◽

André F Rendeiro

Keyword(s):

Biological Sample ◽

Biological Research ◽

Data Annotation ◽

Data Intensive ◽

Sample Data ◽

Accepted Standard ◽

Modular Analysis ◽

Computing Environments ◽

Definition Of ◽

Project Level

Abstract Background Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software. Results To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata. Conclusions The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.

Erratum to: An overview of the National COVID-19 Chest Imaging Database: data quality and cohort analysis

GigaScience ◽

10.1093/gigascience/giab083 ◽

2021 ◽

Vol 10 (12) ◽

Author(s):

Dominic Cushnan ◽

Oscar Bennett ◽

Rosalind Berka ◽

Ottavia Bertolli ◽

Ashwin Chopra ◽

...

Keyword(s):

Data Quality ◽

Cohort Analysis ◽

Chest Imaging

Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2

GigaScience ◽

10.1093/gigascience/giab082 ◽

2021 ◽

Vol 10 (12) ◽

Cited By ~ 1

Author(s):

Jeffrey N Law ◽

Kyle Akers ◽

Nure Tasnina ◽

Catherine M Della Santina ◽

Shay Deutsch ◽

...

Keyword(s):

Broad Class ◽

Interaction Network ◽

Emerging Viruses ◽

Protein Protein Interaction ◽

Functional Relationships ◽

Manual Adjustment ◽

Network Propagation ◽

Human Proteins ◽

Adjustment Of Parameters ◽

Protein Protein Interaction Network

Abstract Background Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction. Results We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents. Conclusions We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.

GigaScience
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating Pacific Biosciences long reads and a high-density linkage map

Citation needed? Wikipedia bibliometrics during the first wave of the COVID-19 pandemic

Inferring microbiota functions from taxonomic genes: a review

Chromosome-level genome assembly of the shuttles hoppfish, Periophthalmus modestus

Halvade somatic: Somatic variant calling with Apache Spark

Multi-dimensional leaf phenotypes reflect root system genotype in grafted grapevine over the growing season

A high-quality genome and comparison of short- versus long-read transcriptome of the palaearctic duck Aythya fuligula (tufted duck)

Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects

Erratum to: An overview of the National COVID-19 Chest Imaging Database: data quality and cohort analysis

Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2

Export Citation Format

GigaScienceLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating Pacific Biosciences long reads and a high-density linkage map

Citation needed? Wikipedia bibliometrics during the first wave of the COVID-19 pandemic

Inferring microbiota functions from taxonomic genes: a review

Chromosome-level genome assembly of the shuttles hoppfish, Periophthalmus modestus

Halvade somatic: Somatic variant calling with Apache Spark

Multi-dimensional leaf phenotypes reflect root system genotype in grafted grapevine over the growing season

A high-quality genome and comparison of short- versus long-read transcriptome of the palaearctic duck Aythya fuligula (tufted duck)

Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects

Erratum to: An overview of the National COVID-19 Chest Imaging Database: data quality and cohort analysis

Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2

GigaScience
Latest Publications