scholarly journals Lep-Anchor: automated construction of linkage map anchored haploid genomes

2020 ◽  
Vol 36 (8) ◽  
pp. 2359-2364 ◽  
Author(s):  
Pasi Rastas

Abstract Motivation Linkage mapping provides a practical way to anchor de novo genome assemblies into chromosomes and to detect chimeric or otherwise erroneous contigs. Such anchoring improves with higher number of markers and individuals, as long as the mapping software can handle all the information. Recent software Lep-MAP3 can robustly construct linkage maps for millions of genotyped markers and on thousands of individuals, providing optimal maps for genome anchoring. For such large datasets, automated and robust genome anchoring tool is especially valuable and can significantly reduce intensive computational and manual work involved. Results Here, we present a software Lep-Anchor (LA) to anchor genome assemblies automatically using dense linkage maps. As the main novelty, it takes into account the uncertainty of the linkage map positions caused by low recombination regions, cross type or poor mapping data quality. Furthermore, it can automatically detect and cut chimeric contigs, and use contig–contig, single read or alternative genome assembly alignments as additional information on contig order and orientations and to collapse haplotype contigs. We demonstrate the performance of LA using real data and show that it outperforms ALLMAPS on anchoring completeness and speed. Accuracy-wise LA and ALLMAPS are about equal, but at the expense of lower completeness of ALLMAPS. The software Chromonomer was faster than the other two methods but has major limitations and is lower in accuracy. We also show that with additional information, such as contig–contig and read alignments, the anchoring completeness can be improved by up to 70% without significant loss in accuracy. Based on simulated data, we conclude that the anchoring accuracy can be improved by utilizing information about map position uncertainty. Accuracy is the rate of contigs in correct orientation and completeness is the number contigs with inferred orientation. Availability and implementation Lep-Anchor is available with the source code under GNU general public license from http://sourceforge.net/projects/lep-anchor. All the scripts and code used to produce the reported results are included with Lep-Anchor.

2021 ◽  
Vol 17 (11) ◽  
pp. e1009631
Author(s):  
Raquel Linheiro ◽  
John Archer

With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/.


Author(s):  
Aleksey V. Zimin ◽  
Steven L. Salzberg

AbstractThe introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8–15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to “polish” the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.


2020 ◽  
Vol 21 (S21) ◽  
Author(s):  
Zicheng Zhao ◽  
Yingxiao Zhou ◽  
Shuai Wang ◽  
Xiuqing Zhang ◽  
Changfa Wang ◽  
...  

Abstract Background Genome assembly is fundamental for de novo genome analysis. Hybrid assembly, utilizing various sequencing technologies increases both contiguity and accuracy. While such approaches require extra costly sequencing efforts, the information provided millions of existed whole-genome sequencing data have not been fully utilized to resolve the task of scaffolding. Genetic recombination patterns in population data indicate non-random association among alleles at different loci, can provide physical distance signals to guide scaffolding. Results In this paper, we propose LDscaff for draft genome assembly incorporating linkage disequilibrium information in population data. We evaluated the performance of our method with both simulated data and real data. We simulated scaffolds by splitting the pig reference genome and reassembled them. Gaps between scaffolds were introduced ranging from 0 to 100 KB. The genome misassembly rate is 2.43% when there is no gap. Then we implemented our method to refine the Giant Panda genome and the donkey genome, which are purely assembled by NGS data. After LDscaff treatment, the resulting Panda assembly has scaffold N50 of 3.6 MB, 2.5 times larger than the original N50 (1.3 MB). The re-assembled donkey assembly has an improved N50 length of 32.1 MB from 23.8 MB. Conclusions Our method effectively improves the assemblies with existed re-sequencing data, and is an potential alternative to the existing assemblers required for the collection of new data.


Genetics ◽  
1996 ◽  
Vol 143 (1) ◽  
pp. 517-529
Author(s):  
Kuldeep Singh ◽  
D S Multani ◽  
Gurdev S Khush

Abstract Secondary trisomics and telotrisomics representing the 12 chromosomes of rice were isolated from the progenies of primary trisomics. A large population of each primary trisomic was grown. Plants showing variation in gross morphology compared to the primary trisomics and disomic sibs were selected and analyzed cytologically at diakinesis and pachytene. Secondary trisomics for both arms of chromosomes 1, 2, 6, 7 and 11 and for one arm of chromosomes 4, 5, 8, 9 and 12 were identified. Telotrisomics for short arm of chromosomes 1, 8, 9 and 10 and for long arms of chromosomes 2, 3 and 5 were isolated. These secondary and telotrisomics were characterized morphologically and for breeding behavior. Secondary trisomics 2n + 1S · 1S, 2n + 1L · 1L, 2n + 2S · 2S, 2n + 2L · 2L, 2n + 6S · 6S, 2n + 6L · 6L and 2n + 7L · 7L are highly sterile, and 2n + 1L · 1L, 2n + 2L · 2L and 2n + 7L · 7L do not set any seed even upon backcrossing. Telotrisomics are fertile and vigorous. Genetic segregation of 43 marker genes was studied in the F2 or backcross progenies. On the basis of segregation data, these genes were delimited to specific chromosome arms. Correct orientation of 10 linkage groups was determined and centromere positions on nine linkage groups were approximated. A revised linkage map of rice is presented.


Metabolites ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 214
Author(s):  
Aneta Sawikowska ◽  
Anna Piasecka ◽  
Piotr Kachlicki ◽  
Paweł Krajewski

Peak overlapping is a common problem in chromatography, mainly in the case of complex biological mixtures, i.e., metabolites. Due to the existence of the phenomenon of co-elution of different compounds with similar chromatographic properties, peak separation becomes challenging. In this paper, two computational methods of separating peaks, applied, for the first time, to large chromatographic datasets, are described, compared, and experimentally validated. The methods lead from raw observations to data that can form inputs for statistical analysis. First, in both methods, data are normalized by the mass of sample, the baseline is removed, retention time alignment is conducted, and detection of peaks is performed. Then, in the first method, clustering is used to separate overlapping peaks, whereas in the second method, functional principal component analysis (FPCA) is applied for the same purpose. Simulated data and experimental results are used as examples to present both methods and to compare them. Real data were obtained in a study of metabolomic changes in barley (Hordeum vulgare) leaves under drought stress. The results suggest that both methods are suitable for separation of overlapping peaks, but the additional advantage of the FPCA is the possibility to assess the variability of individual compounds present within the same peaks of different chromatograms.


2021 ◽  
Vol 10 (7) ◽  
pp. 435
Author(s):  
Yongbo Wang ◽  
Nanshan Zheng ◽  
Zhengfu Bian

Since pairwise registration is a necessary step for the seamless fusion of point clouds from neighboring stations, a closed-form solution to planar feature-based registration of LiDAR (Light Detection and Ranging) point clouds is proposed in this paper. Based on the Plücker coordinate-based representation of linear features in three-dimensional space, a quad tuple-based representation of planar features is introduced, which makes it possible to directly determine the difference between any two planar features. Dual quaternions are employed to represent spatial transformation and operations between dual quaternions and the quad tuple-based representation of planar features are given, with which an error norm is constructed. Based on L2-norm-minimization, detailed derivations of the proposed solution are explained step by step. Two experiments were designed in which simulated data and real data were both used to verify the correctness and the feasibility of the proposed solution. With the simulated data, the calculated registration results were consistent with the pre-established parameters, which verifies the correctness of the presented solution. With the real data, the calculated registration results were consistent with the results calculated by iterative methods. Conclusions can be drawn from the two experiments: (1) The proposed solution does not require any initial estimates of the unknown parameters in advance, which assures the stability and robustness of the solution; (2) Using dual quaternions to represent spatial transformation greatly reduces the additional constraints in the estimation process.


2021 ◽  
Vol 11 (11) ◽  
pp. 5043
Author(s):  
Xi Chen ◽  
Bo Kang ◽  
Jefrey Lijffijt ◽  
Tijl De Bie

Many real-world problems can be formalized as predicting links in a partially observed network. Examples include Facebook friendship suggestions, the prediction of protein–protein interactions, and the identification of hidden relationships in a crime network. Several link prediction algorithms, notably those recently introduced using network embedding, are capable of doing this by just relying on the observed part of the network. Often, whether two nodes are linked can be queried, albeit at a substantial cost (e.g., by questionnaires, wet lab experiments, or undercover work). Such additional information can improve the link prediction accuracy, but owing to the cost, the queries must be made with due consideration. Thus, we argue that an active learning approach is of great potential interest and developed ALPINE (Active Link Prediction usIng Network Embedding), a framework that identifies the most useful link status by estimating the improvement in link prediction accuracy to be gained by querying it. We proposed several query strategies for use in combination with ALPINE, inspired by the optimal experimental design and active learning literature. Experimental results on real data not only showed that ALPINE was scalable and boosted link prediction accuracy with far fewer queries, but also shed light on the relative merits of the strategies, providing actionable guidance for practitioners.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Camilo Broc ◽  
Therese Truong ◽  
Benoit Liquet

Abstract Background The increasing number of genome-wide association studies (GWAS) has revealed several loci that are associated to multiple distinct phenotypes, suggesting the existence of pleiotropic effects. Highlighting these cross-phenotype genetic associations could help to identify and understand common biological mechanisms underlying some diseases. Common approaches test the association between genetic variants and multiple traits at the SNP level. In this paper, we propose a novel gene- and a pathway-level approach in the case where several independent GWAS on independent traits are available. The method is based on a generalization of the sparse group Partial Least Squares (sgPLS) to take into account groups of variables, and a Lasso penalization that links all independent data sets. This method, called joint-sgPLS, is able to convincingly detect signal at the variable level and at the group level. Results Our method has the advantage to propose a global readable model while coping with the architecture of data. It can outperform traditional methods and provides a wider insight in terms of a priori information. We compared the performance of the proposed method to other benchmark methods on simulated data and gave an example of application on real data with the aim to highlight common susceptibility variants to breast and thyroid cancers. Conclusion The joint-sgPLS shows interesting properties for detecting a signal. As an extension of the PLS, the method is suited for data with a large number of variables. The choice of Lasso penalization copes with architectures of groups of variables and observations sets. Furthermore, although the method has been applied to a genetic study, its formulation is adapted to any data with high number of variables and an exposed a priori architecture in other application fields.


2021 ◽  
Vol 11 (2) ◽  
pp. 582
Author(s):  
Zean Bu ◽  
Changku Sun ◽  
Peng Wang ◽  
Hang Dong

Calibration between multiple sensors is a fundamental procedure for data fusion. To address the problems of large errors and tedious operation, we present a novel method to conduct the calibration between light detection and ranging (LiDAR) and camera. We invent a calibration target, which is an arbitrary triangular pyramid with three chessboard patterns on its three planes. The target contains both 3D information and 2D information, which can be utilized to obtain intrinsic parameters of the camera and extrinsic parameters of the system. In the proposed method, the world coordinate system is established through the triangular pyramid. We extract the equations of triangular pyramid planes to find the relative transformation between two sensors. One capture of camera and LiDAR is sufficient for calibration, and errors are reduced by minimizing the distance between points and planes. Furthermore, the accuracy can be increased by more captures. We carried out experiments on simulated data with varying degrees of noise and numbers of frames. Finally, the calibration results were verified by real data through incremental validation and analyzing the root mean square error (RMSE), demonstrating that our calibration method is robust and provides state-of-the-art performance.


Sign in / Sign up

Export Citation Format

Share Document