scholarly journals SVJAM: Joint Analysis of Structural Variants Using Linked Read Sequencing Data

2021 ◽  
Author(s):  
Mustafa Hakan Gunturkun ◽  
Flavia Villani ◽  
Vincenza Colonna ◽  
David Ashbrook ◽  
Robert W Williams ◽  
...  

Linked-read whole genome sequencing methods, such as the 10x Chromium, attach a unique molecular barcode to each high molecular weight DNA molecule. The samples are then sequenced using short-read technology. During analysis, sequence reads sharing the same barcode are aligned to adjacent genomic locations. The pattern of barcode sharing between genomic regions allows the discovery of large structural variants (SVs) in the range of 1 Kb to a few Mb. Most SV calling methods for these data, such as LongRanger, analyze one sample at a time and often produces inconsistent results for the same genomic location across multiple samples. We developed a method, SVJAM, for joint calling of SVs, using data from 152 members of the BXD family of recombinant inbred strains of mice. Our method first collects candidate SV regions from single sample analysis, such as those produced by LongRanger. We then retrieve barcode overlapping data from all samples for each region. These data are organized as a high dimensional matrix. The dimension of this matrix is then reduced using principal component analysis. Samples projected onto a two dimensional space formed by the first two principal components forms two or three clusters based on their genotype, representing the reference, alternative, or heterozygotic alleles. We developed a novel distance measure for hierarchical clustering and rotating the axes to find the optimal clustering results. We also developed an algorithm to decide whether the pattern of sample distribution is best fitted with one, two, or three genotypes. For each sample, we calculate its membership score for each genotype. We compared results produced by SVJAM with LongRanger and few methods that rely on PacBio or Oxford Nanopore data. In a comparison of SVJAM with SV detected using long-read sequencing data for the DBA/2J strain, we found that our results recovered many SVs missed by LongRanger. We also found many SVs called by LongRanger were assigned with an incorrect SV type. Our algorithm also consistently identified heterozygotic regions.

2021 ◽  
Author(s):  
Casia Nursyifa ◽  
Anna Bruniche-Olsen ◽  
Genis Garcia-Erill ◽  
Rasmus Heller ◽  
Anders Albrechtsen

Being able to assign sex to individuals and identify autosomal and sex-linked scaffolds are essential in most population genomic analyses. Non-model organisms often have genome assemblies at scaffold level and lack characterization of sex-linked scaffolds. Previous methods to identify sex and sex-linked scaffolds have relied on e.g. sequence similarity between the non-model organism and a closely related species or prior knowledge about the sex of the samples to identify sex-linked scaffolds. In the latter case, the difference in depth of coverage between the autosomes and the sex chromosomes are used. Here we present "Sex Assignment Through Coverage" (SATC), a method to identify sample sex and sex-linked scaffolds from NGS data. The method only requires a scaffold level reference assembly and sampling of both sexes with whole genome sequencing (WGS) data. We use the sequencing depth distribution across scaffolds to jointly identify: i) male and female individuals and ii) sex-linked scaffolds. This is achieved through projecting the scaffold depths into a low-dimensional space using principal component analysis (PCA) and subsequent Gaussian mixture clustering. We demonstrate the applicability of our method using data from five mammal species and a bird species complex. The method is open source and freely available at https://github.com/popgenDK/SATC


1984 ◽  
Vol 195 (1-2) ◽  
pp. 153-158 ◽  
Author(s):  
Andras Gal ◽  
Jean-Louis Nahon ◽  
Gérard Lucotte ◽  
José M. Sala-Trepat

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Carlos G. Urzúa-Traslaviña ◽  
Vincent C. Leeuwenburgh ◽  
Arkajyoti Bhattacharya ◽  
Stefan Loipfinger ◽  
Marcel A. T. M. van Vugt ◽  
...  

AbstractThe interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal.


2018 ◽  
Vol 37 (10) ◽  
pp. 1233-1252 ◽  
Author(s):  
Jonathan Hoff ◽  
Alireza Ramezani ◽  
Soon-Jo Chung ◽  
Seth Hutchinson

In this article, we present methods to optimize the design and flight characteristics of a biologically inspired bat-like robot. In previous, work we have designed the topological structure for the wing kinematics of this robot; here we present methods to optimize the geometry of this structure, and to compute actuator trajectories such that its wingbeat pattern closely matches biological counterparts. Our approach is motivated by recent studies on biological bat flight that have shown that the salient aspects of wing motion can be accurately represented in a low-dimensional space. Although bats have over 40 degrees of freedom (DoFs), our robot possesses several biologically meaningful morphing specializations. We use principal component analysis (PCA) to characterize the two most dominant modes of biological bat flight kinematics, and we optimize our robot’s parametric kinematics to mimic these. The method yields a robot that is reduced from five degrees of actuation (DoAs) to just three, and that actively folds its wings within a wingbeat period. As a result of mimicking synergies, the robot produces an average net lift improvesment of 89% over the same robot when its wings cannot fold.


2018 ◽  
Author(s):  
Pier Francesco Palamara ◽  
Jonathan Terhorst ◽  
Yun S. Song ◽  
Alkes L. Price

AbstractInterest in reconstructing demographic histories has motivated the development of methods to estimate locus-specific pairwise coalescence times from whole-genome sequence data. We developed a new method, ASMC, that can estimate coalescence times using only SNP array data, and is 2-4 orders of magnitude faster than previous methods when sequencing data are available. We were thus able to apply ASMC to 113,851 phased British samples from the UK Biobank, aiming to detect recent positive selection by identifying loci with unusually high density of very recent coalescence times. We detected 12 genome-wide significant signals, including 6 loci with previous evidence of positive selection and 6 novel loci, consistent with coalescent simulations showing that our approach is well-powered to detect recent positive selection. We also applied ASMC to sequencing data from 498 Dutch individuals (Genome of the Netherlands data set) to detect background selection at deeper time scales. We observed highly significant correlations between average coalescence time inferred by ASMC and other measures of background selection. We investigated whether this signal translated into an enrichment in disease and complex trait heritability by analyzing summary association statistics from 20 independent diseases and complex traits (average N=86k) using stratified LD score regression. Our background selection annotation based on average coalescence time was strongly enriched for heritability (p = 7×10−153) in a joint analysis conditioned on a broad set of functional annotations (including other background selection annotations), meta-analyzed across traits; SNPs in the top 20% of our annotation were 3.8x enriched for heritability compared to the bottom 20%. These results underscore the widespread effects of background selection on disease and complex trait heritability.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jonas Meisner ◽  
Anders Albrechtsen ◽  
Kristian Hanghøj

Abstract Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.


Author(s):  
Hiteshwari Sabrol ◽  
Satish Kumar

Plant disease recognition concept is one of the successful and important applications of image processing and able to provide accurate and useful information to timely prediction and control of plant diseases. In the study, the wavelet based features computed from RGB images of late blight infected images and healthy images. The extracted features submitted to Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA) and Independent Component Analysis performed (ICA) for reducing dimensions in feature data processing and classification. To recognize and classify late blight from healthy plant images are classified into two classes i.e.  late blight infected or healthy. The Euclidean Distance measure is used to compute the distance by these two classes of training and testing dataset for tomato late blight recognition and classification. Finally, the three-component analysis is compared for late blight recognition accuracy. The Kernel Principal Component Analysis (KPCA) yielded overall recognition accuracy with 96.4%.


Antibiotics ◽  
2021 ◽  
Vol 10 (10) ◽  
pp. 1274
Author(s):  
Michelle Li ◽  
Kyle Wang ◽  
Ashley Tang ◽  
Aaron Tang ◽  
Andrew Chen ◽  
...  

Salmonella spp. and Escherichiacoli (E. coli) are two of the deadliest foodborne pathogens in the US. Genes involved in antimicrobial resistance, virulence, and stress response, enable these pathogens to increase their pathogenicity. This study aims to examine the genes detected in both outbreak and non-outbreak Salmonella spp. and E. coli by analyzing the data from the National Centre for Biotechnology Information (NCBI) Pathogen Detection Isolates Browser database. A multivariate statistical analysis was conducted on the genes detected in isolates of outbreak Salmonella spp., non-outbreak Salmonella spp., outbreak E. coli, and non-outbreak E. coli. The genes from the data were projected onto a two-dimensional space through principal component analysis. Hierarchical clustering was then used to quantify the relationship between the genes in the dataset. Most of the outlier genes identified in E. coli isolates are virulence genes, while outlier genes identified in Salmonella spp. are mainly involved in stress response. Gene epeA, which encodes a high-molecular-weight serine protease autotransporter of Enterobacteriaceae (SPATE) protein, along with subA and subB that encode cytotoxic activity, may contribute to the pathogenesis of outbreak E. coli. The iro operon and ars operon may play a role in the ecological success of the epidemic clones of Salmonella spp. Concurrent relationships between esp and ter operons in E. coli and pco and sil operons in Salmonella spp. are found. Stress-response genes (asr, golT, golS), virulence gene (sinH), and antimicrobial resistance genes (mdsA and mdsB) in Salmonella spp. also show a concurrent relationship. All these findings provide helpful information for experiment design to combat outbreaks of E. coli and Salmonella spp.


2021 ◽  
Author(s):  
Andrei Slabodkin ◽  
Maria Chernigovskaya ◽  
Ivana Mikocziova ◽  
Rahmad Akbar ◽  
Lonneke Scheffer ◽  
...  

The process of recombination between variable (V), diversity (D), and joining (J) immunoglobulin (Ig) gene segments determines an individual's naive Ig repertoire, and consequently (auto)antigen recognition. VDJ recombination follows probabilistic rules that can be modeled statistically. So far, it remains unknown whether VDJ recombination rules differ between individuals. If these rules differed, identical (auto)antigen-specific Ig sequences would be generated with individual-specific probabilities, signifying that the available Ig sequence space is individual-specific. We devised a sensitivity-tested distance measure that enables inter-individual comparison of VDJ recombination models. We discovered, accounting for several sources of noise as well as allelic variation in Ig sequencing data, that not only unrelated individuals but also human monozygotic twins and even inbred mice possess statistically distinguishable immunoglobulin recombination models. This suggests that, in addition to genetic, there is also non-genetic modulation of VDJ recombination. We demonstrate that population-wide individualized VDJ recombination can result in orders of magnitude of difference in the probability to generate (auto)antigen-specific Ig sequences. Our findings have implications for immune receptor-based individualized medicine approaches relevant to vaccination, infection, and autoimmunity.


2020 ◽  
Author(s):  
Lei Zhang

Abstract Objective: The use of surface electromyography (sEMG) to realize the recognition of the movement intention can realize the control of the artificial hand or the robot, and can help the rehabilitation training for hemiplegia or muscle weakness. However, the sEMG are weak and susceptible to external interference, so the current research focuses on identifying certain types of movements. But once the subjects are changed, the recognition accuracy will greatly reduce. This study proposes a classification method which the subject could choose optional movements of forearm.Methods: Two sEMG sensors were used, and a 9-axis attitude sensor was added to the wrist. 8 different subjects participated in the experiment, and everyone selected 5 movements. The sEMG sensors were attached to the extensor pollicis brevis and the extensor digitorum. The sEMG features were: Standard Deviation (SD), Power Spectrum Density (PSD); attitude sensor features were: angle and angular acceleration in three dimensional space, and integral of angular acceleration. The results were classified and identified using Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), Decision Tree (DT) and Ensembles (En) algorithms. The results of using the sEMG, using the attitude sensor signals and combining the two were compared. Analysis of variance was conducted on the average accuracy. Features were reduced the dimension by the Principal Component Analysis (PCA), and the results of using PCA and not were compared. Results: The results showed that the combination of the two types of sensors could improve the recognition effect compared to the using sEMG sensor or the attitude sensor alone. The final recognition result was that KNN performed best, reaching 95.0%. The results of using PCA were more stable.Conclusion: The method could be used between different subjects, and the user could select the movements autonomously.Significance: This method can improve the adaptability of movement intention recognition based on sEMG, and has important significance for popularizing the use of the sEMG to control the manipulator or the prosthetic and the rehabilitation training.


Sign in / Sign up

Export Citation Format

Share Document