sequence length
Recently Published Documents


TOTAL DOCUMENTS

665
(FIVE YEARS 248)

H-INDEX

39
(FIVE YEARS 5)

Author(s):  
Shanru Zuo ◽  
Yihu Yi ◽  
Chen Wang ◽  
Xueguang Li ◽  
Mingqing Zhou ◽  
...  

Extrachromosomal circular DNA (eccDNA) is a type of double-stranded circular DNA that is derived and free from chromosomes. It has a strong heterogeneity in sequence, length, and origin and has been identified in both normal and cancer cells. Although many studies suggested its potential roles in various physiological and pathological procedures including aging, telomere and rDNA maintenance, drug resistance, and tumorigenesis, the functional relevance of eccDNA remains to be elucidated. Recently, due to technological advancements, accumulated evidence highlighted that eccDNA plays an important role in cancers by regulating the expression of oncogenes, chromosome accessibility, genome replication, immune response, and cellular communications. Here, we review the features, biogenesis, physiological functions, potential functions in cancer, and research methods of eccDNAs with a focus on some open problems in the field and provide a perspective on how eccDNAs evolve specific functions out of the chaos in cells.


2021 ◽  
Vol 33 (6) ◽  
pp. 238-245
Author(s):  
Seongsik Park ◽  
Kyunghoi Kim

In this study, we carried out case study to predict dissolved oxygen (DO) concentration of Nakdong river estuary with LSTM model. we aimed to figure out a optimal model condition and appropriate predictor for prediction in dissolved oxygen concentration with model parameter and predictor as cases. Model parameter case study results showed that Epoch = 300 and Sequence length = 1 showed higher accuracy than other conditions. In predictor case study, it was highest accuracy where DO and Temperature were used as a predictor, it was caused by high correlation between DO concentration and Temperature. From above results, we figured out an appropriate model condition and predictor for prediction in DO concentration of Nakdong river estuary.


2021 ◽  
Author(s):  
Mohammad Davoud Ghafari ◽  
Iraj Rasooli ◽  
Khosro Khajeh ◽  
Bahareh Dabirmanesh ◽  
Mohammadreza Ghafari ◽  
...  

The phase transition temperature (Tt) prediction of the Elastin-like polypeptides (ELPs) is not trivial because it is related to complex sets of variables such as composition, sequence length, hydrophobic characterization, hydrophilic characterization, the sequence order in the fused proteins, linkers and trailer constructs. In this paper, two unique quantitative models are presented for the prediction of the Tt of a family of ELPs that could be fused to different proteins, linkers, and trailers. The lack of need to use multiple software, peptide information, such as PDB file, as well as knowing the second and third structures of proteins are the advantages of this model besides its high accuracy and speed. One of our models could predict the Tt values of the fused ELPs by entering the protein, linker, and trailer features with R2=99%. Also, another model is able to predict the Tt value by entering the fused protein feature with R2=96%. For more reliability, our method is enriched by Artificial Intelligence (AI) to generate similar proteins. In this regard, Generative Adversarial Network (GAN) is our AI method to create fake proteins and similar values. The experimental results show that our strategy for prediction of Tt is reliable in large data.


2021 ◽  
Vol 13 (1) ◽  
pp. 3
Author(s):  
Jorge Silvestre ◽  
Miguel de Santiago ◽  
Anibal Bregon ◽  
Miguel A. Martínez-Prieto ◽  
Pedro C. Álvarez-Esteban

Predictable operations are the basis of efficient air traffic management. In this context, accurately estimating the arrival time to the destination airport is fundamental to make tactical decisions about an optimal schedule of landing and take-off operations. In this paper, we evaluate different deep learning models based on LSTM architectures for predicting estimated time of arrival of commercial flights, mainly using surveillance data from OpenSky Network. We observed that the number of previous states of the flight used to make the prediction have great influence on the accuracy of the estimation, independently of the architecture. The best model, with an input sequence length of 50, has reported a MAE of 3.33 min and a RMSE of 5.42 min on the test set, with MAE values of 5.67 and 2.13 min 90 and 15 min before the end of the flight, respectively.


2021 ◽  
Vol 119 (1) ◽  
pp. e2109649118
Author(s):  
David H. Brookes ◽  
Amirali Aghazadeh ◽  
Jennifer Listgarten

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.


2021 ◽  
Author(s):  
Michael Vincent Westbury ◽  
Eline D Lorenzen

(1) Within evolutionary biology, mitochondrial genomes (mitogenomes) provide useful insights at both population and species level. Several approaches are available to assemble mitogenomes. However, most are not suitable for divergent, extinct species, due to the requirement of a reference mitogenome from a conspecific or close relative, and relatively high-quality DNA. (2) Iterative mapping can overcome the lack of a close reference sequence, and has been applied to an array of extinct species. Despite its widespread use, the accuracy of the reconstructed assemblies are yet to be comprehensively assessed. Here, we investigated the influence of mapping software (BWA or MITObim), parameters, and bait reference phylogenetic distance on the accuracy of the reconstructed assembly using two simulated datasets: (i) spotted hyena and various mammalian bait references, and (ii) southern cassowary and various avian bait references. Specifically, we assessed the accuracy of results through pairwise distance (PWD) to the reference conspecific mitogenome, number of incorrectly inserted base pairs (bp), and total length of the reconstructed assembly. (3) We found large discrepancies in the accuracy of reconstructed assemblies using different mapping software, parameters, and bait references. PWD to the reference conspecific mitogenome, which reflected the level of incorrect base calls, was consistently higher with BWA than MITObim. The same was observed for the number of incorrectly inserted bp. In contrast, the total sequence length was lower. Overall, the most accurate results were obtained with MITObim using mismatch values of 3 or 5, and the phylogenetically closest bait reference sequence. Accuracy could be further improved by combining results from multiple bait references. (4) We present the first comprehensive investigation of how mapping software, parameters, and bait reference influence mitogenome reconstruction from ancient DNA through iterative mapping. Our study provides information on how mitogenomes are best reconstructed from divergent, short-read data. By obtaining the most accurate reconstruction possible, one can be more confident as to the reliability of downstream analyses, and the evolutionary inferences made from them.


2021 ◽  
Author(s):  
Christoph Flamm ◽  
Julia Wielach ◽  
Michael T. Wolfinger ◽  
Stefan Badelt ◽  
Ronny Lorenz ◽  
...  

Machine learning (ML) and in particular deep learning techniques have gained popularity for predicting structures from biopolymer sequences. An interesting case is the prediction of RNA secondary structures, where well established biophysics based methods exist. These methods even yield exact solutions under certain simplifying assumptions. Nevertheless, the accuracy of these classical methods is limited and has seen little improvement over the last decade. This makes it an attractive target for machine learning and consequently several deep learning models have been proposed in recent years. In this contribution we discuss limitations of current approaches, in particular due to biases in the training data. Furthermore, we propose to study capabilities and limitations of ML models by first applying them on synthetic data that can not only be generated in arbitrary amounts, but are also guaranteed to be free of biases. We apply this idea by testing several ML models of varying complexity. Finally, we show that the best models are capable of capturing many, but not all, properties of RNA secondary structures. Most severely, the number of predicted base pairs scales quadratically with sequence length, even though a secondary structure can only accommodate a linear number of pairs.


2021 ◽  
Vol 22 (24) ◽  
pp. 13485
Author(s):  
Elena S. Babaylova ◽  
Alexander V. Gopanenko ◽  
Alexey E. Tupikin ◽  
Marsel R. Kabilov ◽  
Alexey A. Malygin ◽  
...  

Protein uL5 (formerly called L11) is an integral component of the large (60S) subunit of the human ribosome, and its deficiency in cells leads to the impaired biogenesis of 60S subunits. Using RNA interference, we reduced the level of uL5 in HEK293T cells by three times, which caused an almost proportional decrease in the content of the fraction corresponding to 80S ribosomes, without a noticeable diminution in the level of polysomes. By RNA sequencing of uL5-deficient and control cell samples, which were those of total mRNA and mRNA from the polysome fraction, we identified hundreds of differentially expressed genes (DEGs) at the transcriptome and translatome levels and revealed dozens of genes with altered translational efficiency (GATEs). Transcriptionally up-regulated DEGs were mainly associated with rRNA processing, pre-mRNA splicing, translation and DNA repair, while down-regulated DEGs were genes of membrane proteins; the type of regulation depended on the GC content in the 3′ untranslated regions of DEG mRNAs. The belonging of GATEs to up-regulated and down-regulated ones was determined by the coding sequence length of their mRNAs. Our findings suggest that the effects observed in uL5-deficient cells result from an insufficiency of translationally active ribosomes caused by a deficiency of 60S subunits.


Symmetry ◽  
2021 ◽  
Vol 13 (12) ◽  
pp. 2385
Author(s):  
Xue Sun ◽  
Chao-Chin Wu ◽  
Yan-Fang Liu

In the field of computational biology, sequence alignment is a very important methodology. BLAST is a very common tool for performing sequence alignment in bioinformatics provided by National Center for Biotechnology Information (NCBI) in the USA. The BLAST server receives tens of thousands of queries every day on average. Among the procedures of BLAST, the hit detection process whose core architecture is a lookup table is the most time-consuming. In the latest work, a lightweight BLASTP on CUDA GPU with a hybrid query-index table was proposed for servicing the sequence query length shorter than 512, which effectively improved the query efficiency. According to the reported protein sequence length distribution, about 90% of sequences are equal to or smaller than 1024. In this paper, we propose an improved lightweight BLASTP to speed up the hit detection time for longer query sequences. The largest sequence is enlarged from 512 to 1024. As a result, one more bit is required to encode each sequence position. To meet the requirement, an extended hybrid query-index table (EHQIT) is proposed to accommodate three sequence positions in a four-byte table entry, making only one memory access sufficient to retrieve all the position information as long as the number of hits is equal to or smaller than three. Moreover, if there are more than three hits for a possible word, all the position information will be stored in contiguous table entries, which eliminates branch divergence and reduces memory space for pointers to overflow buffer. A square symmetric scoring matrix, Blosum62, is used to determine the relative score made by matching two characters in a sequence alignment. The experimental results show that for queries shorter than 512 our improved lightweight BLASTP outperforms the original lightweight BLASTP with speedups of 1.2 on average. When the number of hit overflows increases, the speedup can be as high as two. For queries shorter than 1024, our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLAST. In short, the improved lightweight BLASTP can replace the original one because it can support a longer query sequence and provide better performance.


2021 ◽  
Vol 118 (52) ◽  
pp. e2116269118
Author(s):  
Sizhen Li ◽  
He Zhang ◽  
Liang Zhang ◽  
Kaibo Liu ◽  
Boxiang Liu ◽  
...  

The constant emergence of COVID-19 variants reduces the effectiveness of existing vaccines and test kits. Therefore, it is critical to identify conserved structures in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes as potential targets for variant-proof diagnostics and therapeutics. However, the algorithms to predict these conserved structures, which simultaneously fold and align multiple RNA homologs, scale at best cubically with sequence length and are thus infeasible for coronaviruses, which possess the longest genomes (∼30,000 nt) among RNA viruses. As a result, existing efforts on modeling SARS-CoV-2 structures resort to single-sequence folding as well as local folding methods with short window sizes, which inevitably neglect long-range interactions that are crucial in RNA functions. Here we present LinearTurboFold, an efficient algorithm for folding RNA homologs that scales linearly with sequence length, enabling unprecedented global structural analysis on SARS-CoV-2. Surprisingly, on a group of SARS-CoV-2 and SARS-related genomes, LinearTurboFold’s purely in silico prediction not only is close to experimentally guided models for local structures, but also goes far beyond them by capturing the end-to-end pairs between 5′ and 3′ untranslated regions (UTRs) (∼29,800 nt apart) that match perfectly with a purely experimental work. Furthermore, LinearTurboFold identifies undiscovered conserved structures and conserved accessible regions as potential targets for designing efficient and mutation-insensitive small-molecule drugs, antisense oligonucleotides, small interfering RNAs (siRNAs), CRISPR-Cas13 guide RNAs, and RT-PCR primers. LinearTurboFold is a general technique that can also be applied to other RNA viruses and full-length genome studies and will be a useful tool in fighting the current and future pandemics.


Sign in / Sign up

Export Citation Format

Share Document