A Sequence-Ready BAC Clone Contig of a 2.2-Mb Segment of Human Chromosome 1q24

Human chromosomal region 1q24 encodes two cloned disease genes and lies within large genetic inclusion intervals for several disease genes that have yet to be identified. We have constructed a single bacterial artificial chromosome (BAC) clone contig that spans over 2 Mb of 1q24 and consists of 78 clones connected by 100 STSs. The average density of mapped STSs is one of the highest described for a multimegabase region of the human genome. The contig was efficiently constructed by generating STSs from clone ends, followed by library walking. Distance information was added by determining the insert sizes of all clones, and expressed sequence tags (ESTs) and genes were incorporated to create a partial transcript map of the region, providing candidate genes for local disease loci. The gene order and content of the region provide insight into ancient duplication events that have occurred on proximal 1q. The stage is now set for further elucidation of this interesting region through large-scale sequencing.[The sequence data described in this paper have been submitted to GenBank under accession nos. G42259–G42312 and G42330–G42335.]

Download Full-text

Association Analysis and Meta-Analysis of Multi-allelic Variants for Large Scale Sequence Data

10.1101/197913 ◽

2017 ◽

Author(s):

Xiaowei Zhan ◽

Sai Chen ◽

Yu Jiang ◽

Mengzhen Liu ◽

William G. Iacono ◽

...

Keyword(s):

Large Scale ◽

Rare Variants ◽

Sequence Data ◽

Meta Analysis ◽

Joint Modeling ◽

Allelic Variants ◽

Association Analyses ◽

Link Type ◽

Gene Level ◽

The Impact

AbstractMotivation:There is great interest to understand the impact of rare variants in human diseases using large sequence datasets. In deep sequences datasets of >10,000 samples, ∼10% of the variant sites are observed to be multi-allelic. Many of the multi-allelic variants have been shown to be functional and disease relevant. Proper analysis of multi-allelic variants is critical to the success of a sequencing study, but existing methods do not properly handle multi-allelic variants and can produce highly misleading association results.Results:We propose novel methods to encode multi-allelic sites, conduct single variant and gene-level association analyses, and perform meta-analysis for multi-allelic variants. We evaluated these methods through extensive simulations and the study of a large meta-analysis of ∼18,000 samples on the cigarettes-per-day phenotype. We showed that our joint modeling approach provided an unbiased estimate of genetic effects, greatly improved the power of single variant association tests, and enhanced gene-level tests over existing approaches.Availability:Software packages implementing these methods are available at (https://github.com/zhanxw/rvtestshttp://genome.sph.umich.edu/wiki/RareMETAL).Contact:[email protected]; [email protected]

Download Full-text

A novel high-accuracy genome assembly method utilizing a high-throughput workflow

10.1101/2020.11.26.400507 ◽

2020 ◽

Author(s):

Qingdong Zeng ◽

Wenjin Cao ◽

Liping Xing ◽

Guowei Qin ◽

Jianhui Wu ◽

...

Keyword(s):

High Throughput ◽

Genome Assembly ◽

Large Scale ◽

Sequence Data ◽

Physical Map ◽

Bac Library ◽

High Accuracy ◽

Biological Research ◽

Bac Clone ◽

Assembly Method

AbstractAcross domains of biological research using genome sequence data, high-quality reference genome sequences are essential for characterizing genetic variation and understanding the genetic basis of phenotypes. However, the construction of genome assemblies for various species is often hampered by complexities of genome organization, especially repetitive and complex sequences, leading to mis-assembly and missing regions. Here, we describe a high-throughput gold standard genome assembly workflow using a large-scale bacterial artificial chromosome (BAC) library with a refined two-step pooling strategy and the Lamp assembler algorithm. This strategy minimizes the laborious processes of physical map construction and clone-by-clone sequencing, enabling inexpensive sequencing of several thousand BAC clones. By applying this strategy with a minimum tiling path BAC clone library for the short arm of chromosome 2D (2DS) of bread wheat, 98% of BAC sequences, covering 92.7% of the 2DS chromosome, were assembled correctly for this species with a highly complex and repetitive genome. We also identified 48 large mis-assemblies in the reference wheat genome assembly (IWGSC RefSeq v1.0) and corrected these large mis-assemblies in addition to filling 92.2% of the gaps in RefSeq v1.0. Our 2DS assembly represents a new benchmark for the assembly of complex genomes with both high accuracy and efficiency.

Download Full-text

Molecular diagnoses in the congenital malformations caused by ciliopathies cohort of the 100,000 Genomes Project

Journal of Medical Genetics ◽

10.1136/jmedgenet-2021-108065 ◽

2021 ◽

pp. jmedgenet-2021-108065

Author(s):

Sunayna Best ◽

Jenny Lord ◽

Matthew Roche ◽

Christopher M Watson ◽

James A Poulter ◽

...

Keyword(s):

Congenital Malformations ◽

Large Scale ◽

Sequence Data ◽

Joubert Syndrome ◽

Molecular Diagnostic ◽

Disease Genes ◽

Inherited Disorders ◽

Human Phenotype ◽

Diagnostic Strategies ◽

Pathogenic Variants

BackgroundPrimary ciliopathies represent a group of inherited disorders due to defects in the primary cilium, the ‘cell’s antenna’. The 100,000 Genomes Project was launched in 2012 by Genomics England (GEL), recruiting National Health Service (NHS) patients with eligible rare diseases and cancer. Sequence data were linked to Human Phenotype Ontology (HPO) terms entered by recruiting clinicians.MethodsEighty-three prescreened probands were recruited to the 100,000 Genomes Project suspected to have congenital malformations caused by ciliopathies in the following disease categories: Bardet-Biedl syndrome (n=45), Joubert syndrome (n=14) and ‘Rare Multisystem Ciliopathy Disorders’ (n=24). We implemented a bespoke variant filtering and analysis strategy to improve molecular diagnostic rates for these participants.ResultsWe determined a research molecular diagnosis for n=43/83 (51.8%) probands. This is 19.3% higher than previously reported by GEL (n=27/83 (32.5%)). A high proportion of diagnoses are due to variants in non-ciliopathy disease genes (n=19/43, 44.2%) which may reflect difficulties in clinical recognition of ciliopathies. n=11/83 probands (13.3%) had at least one causative variant outside the tiers 1 and 2 variant prioritisation categories (GEL’s automated triaging procedure), which would not be reviewed in standard 100,000 Genomes Project diagnostic strategies. These include four structural variants and three predicted to cause non-canonical splicing defects. Two unrelated participants have biallelic likely pathogenic variants in LRRC45, a putative novel ciliopathy disease gene.ConclusionThese data illustrate the power of linking large-scale genome sequence to phenotype information. They demonstrate the value of research collaborations in order to maximise interpretation of genomic data.

Download Full-text

Comparative Sequence of Human and Mouse BAC Clones from the mnd2 Region of Chromosome 2p13

Genome Research ◽

10.1101/gr.9.1.53 ◽

1999 ◽

Vol 9 (1) ◽

pp. 53-61 ◽

Cited By ~ 9

Author(s):

Wonhee Jang ◽

Axin Hua ◽

Sandra V. Spilson ◽

Webb Miller ◽

Bruce A. Roe ◽

...

Keyword(s):

Genomic Dna ◽

Genomic Sequence ◽

Sequence Data ◽

Lysyl Oxidase ◽

Neuromuscular Disorder ◽

Bac Clone ◽

Link Type ◽

Sequence Elements ◽

Human And Mouse ◽

Mouse Genomic

The mnd2 mutation on mouse chromosome 6 produces a progressive neuromuscular disorder. To determine the gene content of the 400-kb mnd2 nonrecombinant region, we sequenced 108 kb of mouse genomic DNA and 92 kb of human genomic sequence from the corresponding region of chromosome 2p13.3. Three genes with the indicated sizes and intergenic distances were identified:D6Mm5e (⩾81 kb)–787 bp–DOK (2 kb)–845 bp–LOR2 (⩾6 kb). D6Mm5e is expressed in many tissues at very low abundance and the predicted 526-residue protein contains no known functional domains. DOK encodes the p62dok rasGAP binding protein involved in signal transduction. LOR2 encodes a novel lysyl oxidase-related protein of 757 amino acid residues. We describe a simple search protocol for identification of conserved internal exons in genomic sequence. Evolutionary conservation proved to be a useful criterion for distinguishing between authentic exons and artifactual products obtained by exon amplification, RT–PCR, and 5′ RACE. Conserved noncoding sequence elements longer than 80 bp with ⩾75% nucleotide sequence identity comprise ∼1% of the genomic sequence in this region. Comparative analysis of this human and mouse genomic DNA sequence was an efficient method for gene identification and is independent of developmental stage or quantitative level of gene expression.[The sequence data described in this paper have been submitted to the GenBank data library under the following accession numbers: AC003061, mouse BAC clone 245c12; AC003065, human BAC clone h173(E10); AF053368, mouse Lor2 cDNA; AF084363, 108-kb contig from mouse BAC 245c12; AF084364, mouse D6Mm5ecDNA.]

Download Full-text

Computer-Based Methods for the Mouse Full-Length cDNA Encyclopedia: Real-Time Sequence Clustering for Construction of a Nonredundant cDNA Library

Genome Research ◽

10.1101/gr.145701 ◽

2001 ◽

Vol 11 (2) ◽

pp. 281-289

Author(s):

Hideaki Konno ◽

Yoshifumi Fukunishi ◽

Kazuhiro Shibata ◽

Masayoshi Itoh ◽

Piero Carninci ◽

...

Keyword(s):

Cdna Library ◽

Large Scale ◽

Sequence Data ◽

Full Length ◽

Cdna Libraries ◽

Full Length Cdna ◽

Link Type ◽

Computer Based ◽

End Sequences ◽

Press Time

We developed computer-based methods for constructing a nonredundant mouse full-length cDNA library. Our cDNA library construction process comprises assessment of library quality, sequencing the 3′ ends of inserts and clustering, and completing a re-array to generate a nonredundant library from a redundant one. After the cDNA libraries are generated, we sequence the 5′ ends of the inserts to check the quality of the library; then we determine the sequencing priority of each library. Selected libraries undergo large-scale sequencing of the 3′ ends of the inserts and clustering of the tag sequences. After clustering, the nonredundant library is constructed from the original libraries, which have redundant clones. All libraries, plates, clones, sequences, and clusters are uniquely identified, and all information is saved in the database according to this identifier. At press time, our system has been in place for the past two years; we have clustered 939,725 3′ end sequences into 127,385 groups from 227 cDNA libraries/sublibraries (seehttp://genome.gse.riken.go.jp/).[The sequence data described in this paper have been submitted to the DDBJ data library under accession nos. AV00011–AV175734, AV204013–AV382295, andBB561685–BB609425.]

Download Full-text

Dynamic Planning of Bicycle Stations in Dockless Public Bicycle-sharing System Using Gated Graph Neural Network

ACM Transactions on Intelligent Systems and Technology ◽

10.1145/3446342 ◽

2021 ◽

Vol 12 (2) ◽

pp. 1-22

Author(s):

Jianguo Chen ◽

Kenli Li ◽

Keqin Li ◽

Philip S. Yu ◽

Zeng Zeng

Keyword(s):

Large Scale ◽

Sequence Data ◽

Location Prediction ◽

City Management ◽

Dynamic Planning ◽

Graph Sequence ◽

Graph Modeling ◽

The Government ◽

Location Clustering ◽

Station Location

Benefiting from convenient cycling and flexible parking locations, the Dockless Public Bicycle-sharing (DL-PBS) network becomes increasingly popular in many countries. However, redundant and low-utility stations waste public urban space and maintenance costs of DL-PBS vendors. In this article, we propose a Bicycle Station Dynamic Planning (BSDP) system to dynamically provide the optimal bicycle station layout for the DL-PBS network. The BSDP system contains four modules: bicycle drop-off location clustering, bicycle-station graph modeling, bicycle-station location prediction, and bicycle-station layout recommendation. In the bicycle drop-off location clustering module, candidate bicycle stations are clustered from each spatio-temporal subset of the large-scale cycling trajectory records. In the bicycle-station graph modeling module, a weighted digraph model is built based on the clustering results and inferior stations with low station revenue and utility are filtered. Then, graph models across time periods are combined to create a graph sequence model. In the bicycle-station location prediction module, the GGNN model is used to train the graph sequence data and dynamically predict bicycle stations in the next period. In the bicycle-station layout recommendation module, the predicted bicycle stations are fine-tuned according to the government urban management plan, which ensures that the recommended station layout is conducive to city management, vendor revenue, and user convenience. Experiments on actual DL-PBS networks verify the effectiveness, accuracy, and feasibility of the proposed BSDP system.

Download Full-text

Proteomic profiling dataset of chemical perturbations in multiple biological backgrounds

Scientific Data ◽

10.1038/s41597-021-01008-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Deborah O. Dele-Oni ◽

Karen E. Christianson ◽

Shawn B. Egri ◽

Alvaro Sebastian Vaca Jacome ◽

Katherine C. DeRuff ◽

...

Keyword(s):

Large Scale ◽

Cell Model ◽

Cellular Responses ◽

Proteomic Profiling ◽

Reduced Representation ◽

Link Type ◽

Original Dataset ◽

Quality Control Metrics ◽

Biological Insight ◽

Chromatin Profiling

AbstractWhile gene expression profiling has traditionally been the method of choice for large-scale perturbational profiling studies, proteomics has emerged as an effective tool in this context for directly monitoring cellular responses to perturbations. We previously reported a pilot library containing 3400 profiles of multiple perturbations across diverse cellular backgrounds in the reduced-representation phosphoproteome (P100) and chromatin space (Global Chromatin Profiling, GCP). Here, we expand our original dataset to include profiles from a new set of cardiotoxic compounds and from astrocytes, an additional neural cell model, totaling 5300 proteomic signatures. We describe filtering criteria and quality control metrics used to assess and validate the technical quality and reproducibility of our data. To demonstrate the power of the library, we present two case studies where data is queried using the concept of “connectivity” to obtain biological insight. All data presented in this study have been deposited to the ProteomeXchange Consortium with identifiers PXD017458 (P100) and PXD017459 (GCP) and can be queried at https://clue.io/proteomics.

Download Full-text

The Evolution of Life Modes in Stictidaceae, with Three Novel Taxa

Journal of Fungi ◽

10.3390/jof7020105 ◽

2021 ◽

Vol 7 (2) ◽

pp. 105

Author(s):

Vinodhini Thiyagaraja ◽

Robert Lücking ◽

Damien Ertz ◽

Samantha C. Karunarathna ◽

Dhanushka N. Wanasinghe ◽

...

Keyword(s):

New Species ◽

Large Scale ◽

Sequence Data ◽

Large Subunit ◽

Monte Carlo Sampling ◽

Small Subunit ◽

Sensu Stricto ◽

Internal Transcribed Spacers ◽

Character State ◽

Phenotypic Data

Ostropales sensu lato is a large group comprising both lichenized and non-lichenized fungi, with several lineages expressing optional lichenization where individuals of the same fungal species exhibit either saprotrophic or lichenized lifestyles depending on the substrate (bark or wood). Greatly variable phenotypic characteristics and large-scale phylogenies have led to frequent changes in the taxonomic circumscription of this order. Ostropales sensu lato is currently split into Graphidales, Gyalectales, Odontotrematales, Ostropales sensu stricto, and Thelenellales. Ostropales sensu stricto is now confined to the family Stictidaceae, which includes a large number of species that are poorly known, since they usually have small fruiting bodies that are rarely collected, and thus, their taxonomy remains partly unresolved. Here, we introduce a new genus Ostropomyces to accommodate a novel lineage related to Ostropa, which is composed of two new species, as well as a new species of Sphaeropezia, S. shangrilaensis. Maximum likelihood and Bayesian inference analyses of mitochondrial small subunit spacers (mtSSU), large subunit nuclear rDNA (LSU), and internal transcribed spacers (ITS) sequence data, together with phenotypic data documented by detailed morphological and anatomical analyses, support the taxonomic affinity of the new taxa in Stictidaceae. Ancestral character state analysis did not resolve the ancestral nutritional status of Stictidaceae with confidence using Bayes traits, but a saprotrophic ancestor was indicated as most likely in a Bayesian binary Markov Chain Monte Carlo sampling (MCMC) approach. Frequent switching in nutritional modes between lineages suggests that lifestyle transition played an important role in the evolution of this family.

Download Full-text

Whole-genome sequence data suggests environmental adaptation of Ethiopian sheep populations

Genome Biology and Evolution ◽

10.1093/gbe/evab014 ◽

2021 ◽

Author(s):

Pamela Wiener ◽

Christelle Robert ◽

Abulgasim Ahbara ◽

Mazdak Salavati ◽

Ayele Abebe ◽

...

Keyword(s):

High Altitude ◽

Environmental Variables ◽

Large Scale ◽

Sequence Data ◽

Strong Association ◽

Environmental Adaptation ◽

Whole Genome Sequence ◽

Single Nucleotide Variants ◽

High Altitude Adaptation ◽

Altitude Adaptation

Abstract Great progress has been made over recent years in the identification of selection signatures in the genomes of livestock species. This work has primarily been carried out in commercial breeds for which the dominant selection pressures, are associated with artificial selection. As agriculture and food security are likely to be strongly affected by climate change, a better understanding of environment-imposed selection on agricultural species is warranted. Ethiopia is an ideal setting to investigate environmental adaptation in livestock due to its wide variation in geo-climatic characteristics and the extensive genetic and phenotypic variation of its livestock. Here, we identified over three million single nucleotide variants across 12 Ethiopian sheep populations and applied landscape genomics approaches to investigate the association between these variants and environmental variables. Our results suggest that environmental adaptation for precipitation-related variables is stronger than that related to altitude or temperature, consistent with large-scale meta-analyses of selection pressure across species. The set of genes showing association with environmental variables was enriched for genes highly expressed in human blood and nerve tissues. There was also evidence of enrichment for genes associated with high-altitude adaptation although no strong association was identified with hypoxia-inducible-factor (HIF) genes. One of the strongest altitude-related signals was for a collagen gene, consistent with previous studies of high-altitude adaptation. Several altitude-associated genes also showed evidence of adaptation with temperature, suggesting a relationship between responses to these environmental factors. These results provide a foundation to investigate further the effects of climatic variables on small ruminant populations.

Download Full-text

Comparative physical mapping reveals features of microsynteny between Glycine max, Medicago truncatula, and Arabidopsis thaliana

Genome ◽

10.1139/g03-106 ◽

2004 ◽

Vol 47 (1) ◽

pp. 141-155 ◽

Cited By ~ 38

Author(s):

H H Yan ◽

J Mudge ◽

D-J Kim ◽

R C Shoemaker ◽

D R Cook ◽

...

Keyword(s):

Arabidopsis Thaliana ◽

Glycine Max ◽

Medicago Truncatula ◽

Physical Mapping ◽

Large Scale ◽

Sequence Similarity ◽

Artificial Chromosome ◽

Bac Contig ◽

Cross Hybridization ◽

Bac Contigs

To gain insight into genomic relationships between soybean (Glycine max) and Medicago truncatula, eight groups of bacterial artificial chromosome (BAC) contigs, together spanning 2.60 million base pairs (Mb) in G. max and 1.56 Mb in M. truncatula, were compared through high-resolution physical mapping combined with sequence and hybridization analysis of low-copy BAC ends. Cross-hybridization among G. max and M. truncatula contigs uncovered microsynteny in six of the contig groups and extensive microsynteny in three. Between G. max homoeologous (within genome duplicate) contigs, 85% of coding and 75% of noncoding sequences were conserved at the level of cross-hybridization. By contrast, only 29% of sequences were conserved between G. max and M. truncatula, and some kilobase-scale rearrangements were also observed. Detailed restriction maps were constructed for 11 contigs from the three highly microsyntenic groups, and these maps suggested that sequence order was highly conserved between G. max duplicates and generally conserved between G. max and M. truncatula. One instance of homoeologous BAC contigs in M. truncatula was also observed and examined in detail. A sequence similarity search against the Arabidopsis thaliana genome sequence identified up to three microsyntenic regions in A. thaliana for each of two of the legume BAC contig groups. Together, these results confirm previous predictions of one recent genome-wide duplication in G. max and suggest that M. truncatula also experienced ancient large-scale genome duplications.Key words: Glycine max, Medicago truncatula, Arabidopsis thaliana, conserved microsynteny, genome duplication.

Download Full-text