scholarly journals On enhancing variation detection through pan-genome indexing

2015 ◽  
Author(s):  
Daniel Valenzuela ◽  
Niko Välimäki ◽  
Esa Pitkänen ◽  
Veli Mäkinen

Detection of genomic variants is commonly conducted by aligning a set of reads sequenced from an individual to the reference genome of the species and analyzing the resultingread pileup. Typically, this process finds a subset of variants already reported in databases and additional novel variants characteristic to the sequenced individual. Most of the effort in the literature has been put to the alignment problem on a single reference sequence, although our gathered knowledge on species such as human ispan-genomic: We know most of the common variation in addition to the reference sequence. There have been some efforts to exploitpan-genome indexing, where the most widely adopted approach is to build an index structure on a set of reference sequences containing observed variation combinations. The enhancement in alignment accuracy when using pan-genome indexing has been demonstrated in experiments, but so far the abovemultiple referencespan-genome indexing approach has not been tested on its final goal, that is, in enhancing variation detection. This is the focus of this article: We study a generic approach to add variation detection support on top of the multiple references pan-genomic indexing approach. Namely, we study the read pileup on a multiple alignment of reference genomes, and propose a heaviest path algorithm to extract a new recombined reference sequence. This recombined reference sequence can then be utilized in any standard read alignment and variation detection workflow. We demonstrate that the approach enhances variation detection on realistic data sets.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Aysun Urhan ◽  
Thomas Abeel

AbstractCoronavirus disease 2019 (COVID-19) has emerged in December 2019 when the first case was reported in Wuhan, China and turned into a pandemic with 27 million (September 9th) cases. Currently, there are over 95,000 complete genome sequences of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus causing COVID-19, in public databases, accompanying a growing number of studies. Nevertheless, there is still much to learn about the viral population variation when the virus is evolving as it continues to spread. We have analyzed SARS-CoV-2 genomes to identify the most variant sites, as well as the stable, conserved ones in samples collected in the Netherlands until June 2020. We identified the most frequent mutations in different geographies. We also performed a phylogenetic study focused on the Netherlands to detect novel variants emerging in the late stages of the pandemic and forming local clusters. We investigated the S and N proteins on SARS-CoV-2 genomes in the Netherlands and found the most variant and stable sites to guide development of diagnostics assays and vaccines. We observed that while the SARS-CoV-2 genome has accumulated mutations, diverging from reference sequence, the variation landscape is dominated by four mutations globally, suggesting the current reference does not represent the virus samples circulating currently. In addition, we detected novel variants of SARS-CoV-2 almost unique to the Netherlands that form localized clusters and region-specific sub-populations indicating community spread. We explored SARS-CoV-2 variants in the Netherlands until June 2020 within a global context; our results provide insight into the viral population diversity for localized efforts in tracking the transmission of COVID-19, as well as sequenced-based approaches in diagnostics and therapeutics. We emphasize that little diversity is observed globally in recent samples despite the increased number of mutations relative to the established reference sequence. We suggest sequence-based analyses should opt for a consensus representation to adequately cover the genomic variation observed to speed up diagnostics and vaccine design.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yance Feng ◽  
Lei M. Li

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.


2018 ◽  
Vol 8 (12) ◽  
pp. 2421 ◽  
Author(s):  
Chongya Song ◽  
Alexander Pons ◽  
Kang Yen

In the field of network intrusion, malware usually evades anomaly detection by disguising malicious behavior as legitimate access. Therefore, detecting these attacks from network traffic has become a challenge in this an adversarial setting. In this paper, an enhanced Hidden Markov Model, called the Anti-Adversarial Hidden Markov Model (AA-HMM), is proposed to effectively detect evasion pattern, using the Dynamic Window and Threshold techniques to achieve adaptive, anti-adversarial, and online-learning abilities. In addition, a concept called Pattern Entropy is defined and acts as the foundation of AA-HMM. We evaluate the effectiveness of our approach employing two well-known benchmark data sets, NSL-KDD and CTU-13, in terms of the common performance metrics and the algorithm’s adaptation and anti-adversary abilities.


2021 ◽  
Author(s):  
Pradeep Ruperao ◽  
Nepolean Thirunavukkarasu ◽  
Prasad Gandham ◽  
Sivasubramani S. ◽  
Govindaraj M ◽  
...  

AbstractSorghum (Sorghum bicolor L.) is one of the most important food crops in the arid and rainfed production ecologies. It is a part of resilient farming and is projected as a smart crop to overcome the food and nutritional challenges in the developing world. The development and characterisation of the sorghum pan-genome will provide insight into genome diversity and functionality, supporting sorghum improvement. We built a sorghum pan-genome using reference genomes as well as 354 genetically diverse sorghum accessions belonging to different races. We explored the structural and functional characteristics of the pan-genome and explain its utility in supporting genetic gain. The newly-developed pan-genome has a total of 35,719 genes, a core genome of 16,821 genes and an average of 32,795 genes in each cultivar. The variable genes are enriched with environment responsive genes and classify the sorghum accessions according to their race. We show that 53% of genes display presence-absence variation, and some of these variable genes are predicted to be functionally associated with drought traits. Using more than two million SNPs from the pan-genome, association analysis identified 398 SNPs significantly associated with important agronomic traits, of which, 92 were in genes. Drought gene expression analysis identified 1,788 genes that are functionally linked to different conditions, of which 79 were absent from the reference genome assembly. This study provides comprehensive genomic diversity resources in sorghum which can be used in genome assisted crop improvement.


2021 ◽  
pp. 08-14
Author(s):  
Nafea ali majeed .. ◽  
◽  
◽  
◽  
Khalid Hameed Zaboon ◽  
...  

Recently, the technology become an important part of our live, and it is employed to work together with the Medicine, Space Science, Agriculture, and industry and more else. Stored the information in the servers and cloud become required. It is a global force that has transformed people's lives with the availability of various web applications that serve billions of websites every day. However, there are many types of attack could be targeting the internet, and there is a need to recognize, classify and protect thesis types of attack. Due to its important global role, it has become important to ensure that web applications are secure, accurate, and of high quality. One of the basic problems found on the Web is DDoS attacks. In this work, the review classifies and delineates attack types, test characteristics, evaluation techniques; evaluation methods and test data sets used in the proposed Strategic Strategy methodology. Finally, this work affords guidance and possible targets in the fight against creating better events to overcome the most dangers Cyber-attack types which is DDoS attacks.


2019 ◽  
Vol 12 (1) ◽  
Author(s):  
Yin Li ◽  
Min Tu ◽  
Yaping Feng ◽  
Wenqin Wang ◽  
Joachim Messing

Abstract Background Sorghum bicolor (L.) is an important bioenergy source. The stems of sweet sorghum function as carbon sinks and accumulate large amounts of sugars and lignocellulosic biomass and considerable amounts of starch, therefore providing a model of carbon allocation and accumulation for other bioenergy crops. While omics data sets for sugar accumulation have been reported in different genotypes, the common features of primary metabolism in sweet genotypes remain unclear. To obtain a cohesive and comparative picture of carbohydrate metabolism between sorghum genotypes, we compared the phenotypes and transcriptome dynamics of sugar-accumulating internodes among three different sweet genotypes (Della, Rio, and SIL-05) and two non-sweet genotypes (BTx406 and R9188). Results Field experiments showed that Della and Rio had similar dynamics and internode patterns of sugar concentration, albeit distinct other phenotypes. Interestingly, cellulose synthases for primary cell wall and key genes in starch synthesis and degradation were coordinately upregulated in sweet genotypes. Sweet sorghums maintained active monolignol biosynthesis compared to the non-sweet genotypes. Comparative RNA-seq results support the role of candidate Tonoplast Sugar Transporter gene (TST), but not the Sugars Will Eventually be Exported Transporter genes (SWEETs) in the different sugar accumulations between sweet and non-sweet genotypes. Conclusions Comparisons of the expression dynamics of carbon metabolic genes across the RNA-seq data sets identify several candidate genes with contrasting expression patterns between sweet and non-sweet sorghum lines, including genes required for cellulose and monolignol synthesis (CesA, PTAL, and CCR), starch metabolism (AGPase, SS, SBE, and G6P-translocator SbGPT2), and sucrose metabolism and transport (TPP and TST2). The common transcriptome features of primary metabolism identified here suggest the metabolic networks contributing to carbon sink strength in sorghum internodes, prioritize the candidate genes for manipulating carbon allocation with bioenergy purposes, and provide a comparative and cohesive picture of the complexity of carbon sink strength in sorghum stem.


2000 ◽  
Vol 18 (9) ◽  
pp. 1088-1096 ◽  
Author(s):  
J. M. Holt ◽  
A. P. van Eyken

Abstract. The recent availability of substantial data sets taken by the EISCAT Svalbard Radar allows several important tests to be made on the determination of convection patterns from incoherent scatter radar results. During one 30-h period, the Svalbard Radar made 15 min scans combining local field aligned observations with two, low elevation positions selected to intersect the two beams of the Common Programme Four experiment being simultaneously conducted by the EISCAT VHF radar at Tromsø. The common volume results from the two radars are compared. The plasma convection velocities determined independently by the two radars are shown to agree very closely and the combined three-dimensional velocity data used to test the common assumption of negligible field-aligned flow in this regime.Key words: Ionosphere (auroral ionosphere; polar ionosphere) - Magnetospheric physics (plasma convection)


2010 ◽  
Vol 44-47 ◽  
pp. 3574-3578
Author(s):  
Ai Guo Li ◽  
Chi Zhang ◽  
Jiu Long Zhang ◽  
Zhen Hai Zhang

A new multi-dimensional index structure called RSR-tree is proposed, which based on RS-tree. In RSR-tree, index records of a leaf node are split to ensure the sequence ordering of index records in a leaf node, which reduces the addressing cost of I/O operations effectively when reading data files. The entries of a non-leaf node are split to decreases the overlap between the brother nodes, which reduces effectively the time of reading data from data files. Experimental results on different data sets show that compared to RS-tree, RSR-tree has better comprehensive performance, in regard to tree building and querying. The querying performance is increased and extra cost is not produced.


2011 ◽  
Vol 2011 ◽  
pp. 1-9 ◽  
Author(s):  
Dieter Devlaminck ◽  
Bart Wyns ◽  
Moritz Grosse-Wentrup ◽  
Georges Otte ◽  
Patrick Santens

Motor-imagery-based brain-computer interfaces (BCIs) commonly use the common spatial pattern filter (CSP) as preprocessing step before feature extraction and classification. The CSP method is a supervised algorithm and therefore needs subject-specific training data for calibration, which is very time consuming to collect. In order to reduce the amount of calibration data that is needed for a new subject, one can apply multitask (from now on called multisubject) machine learning techniques to the preprocessing phase. Here, the goal of multisubject learning is to learn a spatial filter for a new subject based on its own data and that of other subjects. This paper outlines the details of the multitask CSP algorithm and shows results on two data sets. In certain subjects a clear improvement can be seen, especially when the number of training trials is relatively low.


Acta Numerica ◽  
2001 ◽  
Vol 10 ◽  
pp. 313-355 ◽  
Author(s):  
Markus Hegland

Methods for knowledge discovery in data bases (KDD) have been studied for more than a decade. New methods are required owing to the size and complexity of data collections in administration, business and science. They include procedures for data query and extraction, for data cleaning, data analysis, and methods of knowledge representation. The part of KDD dealing with the analysis of the data has been termed data mining. Common data mining tasks include the induction of association rules, the discovery of functional relationships (classification and regression) and the exploration of groups of similar data objects in clustering. This review provides a discussion of and pointers to efficient algorithms for the common data mining tasks in a mathematical framework. Because of the size and complexity of the data sets, efficient algorithms and often crude approximations play an important role.


Sign in / Sign up

Export Citation Format

Share Document