scholarly journals CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language

GigaScience ◽  
2019 ◽  
Vol 8 (7) ◽  
Author(s):  
Michael Kotliar ◽  
Andrey V Kartashov ◽  
Artem Barski

Abstract Background Massive growth in the amount of research data and computational analysis has led to increased use of pipeline managers in biomedical computational research. However, each of the >100 such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and therefore poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was recently introduced as a specification for platform-independent workflow description, and work began to transition existing pipelines and workflow managers to CWL. Findings Herein, we present CWL-Airflow, a package that adds support for CWL to the Apache Airflow pipeline manager. CWL-Airflow uses CWL version 1.0 specification and can run workflows on stand-alone MacOS/Linux servers, on clusters, or on a variety of cloud platforms. A sample CWL pipeline for processing of chromatin immunoprecipitation sequencing data is provided. Conclusions CWL-Airflow will provide users with the features of a fully fledged pipeline manager and the ability to execute CWL workflows anywhere Airflow can run—from a laptop to a cluster or cloud environment. CWL-Airflow is available under Apache License, version 2.0 (Apache-2.0), and can be downloaded from https://barski-lab.github.io/cwl-airflow, https://scicrunch.org/resolver/RRID:SCR_017196.

2018 ◽  
Author(s):  
Michael Kotliar ◽  
Andrey V. Kartashov ◽  
Artem Barski

AbstractBackgroundMassive growth in the amount of research data and computational analysis has led to increased utilization of pipeline managers in biomedical computational research. However, each of more than 100 such managers uses its own way to describe pipelines, leading to difficulty porting workflows to different environments and therefore poor reproducibility of computational studies. For this reason, the Common Workflow Language (CWL) was recently introduced as a specification for platform-independent workflow description, and work began to transition existing pipelines and workflow managers to CWL.FindingsHere, we present CWL-Airflow, an extension for the Apache Airflow pipeline manager supporting CWL. CWL-Airflow utilizes CWL v1.0 specification and can be used to run workflows on standalone MacOS/Linux servers, on clusters, or on a variety of cloud platforms. A sample CWL pipeline for processing of ChIP-Seq data is provided.ConclusionsCWL-Airflow will provide users with the features of a fully-fledged pipeline manager and an ability to execute CWL workflows anywhere Airflow can run—from a laptop to cluster or cloud environment.AvailabilityCWL-Airflow is available under Apache license v.2 and can be downloaded from https://barski-lab.github.io/cwl-airflow, http://doi.org/10.5281/zenodo.2669582, RRID: SCR_017196.


2021 ◽  
Vol 17 (10) ◽  
pp. e1009423
Author(s):  
Maxwell W. Libbrecht ◽  
Rachel C. W. Chan ◽  
Michael M. Hoffman

Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, and discuss the outlook for future work. This review is intended for those interested in applying SAGA methods and for computational researchers interested in improving upon them.


Genetics ◽  
2019 ◽  
Vol 212 (3) ◽  
pp. 729-742 ◽  
Author(s):  
Lena Annika Street ◽  
Ana Karina Morao ◽  
Lara Heermans Winterkorn ◽  
Chen-Yu Jiao ◽  
Sarah Elizabeth Albritton ◽  
...  

Condensins are evolutionarily conserved protein complexes that are required for chromosome segregation during cell division and genome organization during interphase. In Caenorhabditis elegans, a specialized condensin, which forms the core of the dosage compensation complex (DCC), binds to and represses X chromosome transcription. Here, we analyzed DCC localization and the effect of DCC depletion on histone modifications, transcription factor binding, and gene expression using chromatin immunoprecipitation sequencing and mRNA sequencing. Across the X, the DCC accumulates at accessible gene regulatory sites in active chromatin and not heterochromatin. The DCC is required for reducing the levels of activating histone modifications, including H3K4me3 and H3K27ac, but not repressive modification H3K9me3. In X-to-autosome fusion chromosomes, DCC spreading into the autosomal sequences locally reduces gene expression, thus establishing a direct link between DCC binding and repression. Together, our results indicate that DCC-mediated transcription repression is associated with a reduction in the activity of X chromosomal gene regulatory elements.


2020 ◽  
Vol 12 (11) ◽  
pp. 1953-1960
Author(s):  
Andrey A Yurchenko ◽  
Hans Recknagel ◽  
Kathryn R Elmer

Abstract Squamate reptiles exhibit high variation in their phenotypic traits and geographical distributions and are therefore fascinating taxa for evolutionary and ecological research. However, genomic resources are very limited for this group of species, consequently inhibiting research efforts. To address this gap, we assembled a high-quality genome of the common lizard, Zootoca vivipara (Lacertidae), using a combination of high coverage Illumina (shotgun and mate-pair) and PacBio sequencing data, coupled with RNAseq data and genetic linkage map generation. The 1.46-Gb genome assembly has a scaffold N50 of 11.52 Mb with N50 contig size of 220.4 kb and only 2.96% gaps. A BUSCO analysis indicates that 97.7% of the single-copy Tetrapoda orthologs were recovered in the assembly. In total, 19,829 gene models were annotated to the genome using a combination of ab initio and homology-based methods. To improve the chromosome-level assembly, we generated a high-density linkage map from wild-caught families and developed a novel analytical pipeline to accommodate multiple paternity and unknown father genotypes. We successfully anchored and oriented almost 90% of the genome on 19 linkage groups. This annotated and oriented chromosome-level reference genome represents a valuable resource to facilitate evolutionary studies in squamate reptiles.


Genes ◽  
2020 ◽  
Vol 11 (4) ◽  
pp. 397
Author(s):  
Dadong Deng ◽  
Xihong Tan ◽  
Kun Han ◽  
Ruimin Ren ◽  
Jianhua Cao ◽  
...  

The development of the placental fold, which increases the maternal–fetal interacting surface area, is of primary importance for the growth of the fetus throughout the whole pregnancy. However, the mechanisms involved remain to be fully elucidated. Increasing evidence has revealed that long non-coding RNAs (lncRNAs) are a new class of RNAs with regulatory functions and could be epigenetically regulated by histone modifications. In this study, 141 lncRNAs (including 73 up-regulated and 68 down-regulated lncRNAs) were identified to be differentially expressed in the placentas of pigs during the establishment and expanding stages of placental fold development. The differentially expressed lncRNAs and genes (DElncRNA-DEgene) co-expression network analysis revealed that these differentially expressed lncRNAs (DElncRNAs) were mainly enriched in pathways of cell adhesion, cytoskeleton organization, epithelial cell differentiation and angiogenesis, indicating that the DElncRNAs are related to the major events that occur during placental fold development. In addition, we integrated the RNA-seq (RNA sequencing) data with the ChIP-seq (chromatin immunoprecipitation sequencing) data of H3K4me3/H3K27ac produced from the placental samples of pigs from the two stages (gestational days 50 and 95). The analysis revealed that the changes in H3K4me3 and/or H3K27ac levels were significantly associated with the changes in the expression levels of 37 DElncRNAs. Furthermore, several H3K4me3/H3K27ac-lncRNAs were characterized to be significantly correlated with genes functionally related to placental development. Thus, this study provides new insights into understanding the mechanisms for the placental development of pigs.


2019 ◽  
Author(s):  
Aaron P. Ragsdale ◽  
Simon Gravel

AbstractLinkage disequilibrium is used to infer evolutionary history and to identify regions under selection or associated with a given trait. In each case, we require accurate estimates of linkage disequilibrium from sequencing data. Unphased data presents a challenge because the co-occurrence of alleles at different loci is ambiguous. Commonly used estimators for the common statistics r2 and D2 exhibit large and variable upward biases that complicate interpretation and comparison across cohorts. Here, we show how to find unbiased estimators for a wide range of two-locus statistics, including D2, for both single and multiple randomly mating populations. These provide accurate estimates over three orders of magnitude in LD. We also use these estimators to construct an estimator for r2 that is less biased than commonly used estimators, but nevertheless argue for using rather than r2 for population size estimates.


Sign in / Sign up

Export Citation Format

Share Document