sequence file Latest Research Papers

The main aim of this study was to develop a set of functions that can analyze the genomic data with less time consumption and memory. Epi-gene is presented as a solution to large sequence file handling and computational time problems. It uses less time and less programming skills in order to work with a large number of genomes. In the current study, some features of the Epi-gene R-package were described and illustrated by using a dataset of the 14 Aeromonas hydrophila genomes. The joining, relabeling, and conversion functions were also included in this package to handle the FASTA formatted sequences. To calculate the subsets of core genes, accessory genes, and unique genes, various Epi-gene functions have been used. Heat maps and phylogenetic genome trees were also constructed. This whole procedure was completed in less than 30 minutes. This package can only work on Windows operating systems. Different functions from other packages such as dplyr and ggtree were also used that were available in R computing environment.

Download Full-text

Performance Study on Indexing and Accessing of Small File in Hadoop Distributed File System

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500519 ◽

2021 ◽

pp. 2150051

Author(s):

Anisha P Rodrigues ◽

Roshan Fernandes ◽

P. Vijaya ◽

Satish Chander

Keyword(s):

Processing Time ◽

Performance Metrics ◽

File System ◽

Distributed File System ◽

Access Time ◽

Memory Usage ◽

File Access ◽

Sequence File ◽

Hadoop Distributed File System ◽

Small File

Hadoop Distributed File System (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the Hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New Hadoop Archive (New HAR), CombineFileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5[Formula: see text]s, memory usage is 20 KB in multi-node, and the processing time is 0.1[Formula: see text]s.

Download Full-text

SeqWho: Reliable, rapid determination of sequence file identity using k-mer frequencies

10.1101/2021.03.10.434827 ◽

2021 ◽

Author(s):

Christopher Bennett ◽

Micah Thornton ◽

Chanhee Park ◽

Gervaise Henry ◽

Yun Zhang ◽

...

Keyword(s):

Rapid Determination ◽

Negative Influence ◽

Repeat Sequence ◽

Sequencing Data ◽

Future Studies ◽

Sequencing Technologies ◽

Sequence File ◽

Human And Mouse

AbstractWith the vast improvements in sequencing technologies and increased number of protocols, sequencing is finding more applications to answer complex biological problems. Thus, the amount of publicly available sequencing data has tremendously increased in repositories such as SRA, EGA, and ENCODE. With any large online database, there is a critical need to accurately document study metadata, such as the source protocol and organism. In some cases, this metadata may not be systematically verified by the hosting sites and may result in a negative influence on future studies. Here we present SeqWho, a program designed to heuristically assess the quality of sequencing files and reliably classify the organism and protocol type. This is done in an alignment-free algorithm that leverages a Random Forest classifier to learn from native biases in k-mer frequencies and repeat sequence identities between different sequencing technologies and species. Here, we show that our method can accurately and rapidly distinguish between human and mouse, nine different sequencing technologies, and both together, 98.32%, 97.86%, and 96.38% of the time in high confidence calls respectively. This demonstrates that SeqWho is a powerful method for reliably checking the identity of the sequencing files used in any pipeline and illustrates the program’s ability to leverage k-mer biases.

Download Full-text

Improvement of cryo-EM maps by density modification

10.1101/845032 ◽

2019 ◽

Cited By ~ 13

Author(s):

Thomas C. Terwilliger ◽

Steven J. Ludtke ◽

Randy J. Read ◽

Paul D. Adams ◽

Pavel V. Afonine

Keyword(s):

Fourier Coefficients ◽

X Ray ◽

X Ray Crystallography ◽

Modification Process ◽

Sequence File ◽

Reference Map ◽

Modification Procedure ◽

Model Correlation ◽

Source Of Information ◽

Distinct Approach

AbstractA density modification procedure for improving maps produced by single-particle electron cryo-microscopy is presented. The theoretical basis of the method is identical to that of maximum-likelihood density modification, previously used to improve maps from macromolecular X-ray crystallography. Two key differences from applications in crystallography are that the errors in Fourier coefficients are largely in the phases in crystallography but in both phases and amplitudes in electron cryo-microscopy, and that half-maps with independent errors are available in electron cryo-microscopy. These differences lead to a distinct approach for combination of information from starting maps with information obtained in the density modification process. The applicability of density modification theory to electron cryo-microscopy was evaluated using half-maps for apoferritin at a resolution of 3.1 Å and a matched 1.8 Å reference map. Error estimates for the map obtained by density modification were found to closely agree with true errors as estimated by comparison with the reference map. The density modification procedure was applied to a set of 104 datasets where half-maps, a full map and a model all had been deposited. The procedure improved map-model correlation and increased the visibility of details in the maps. The procedure requires two unmasked half-maps and a sequence file or other source of information on the volume of the macromolecule that has been imaged.

Download Full-text

AMPLE: a cluster-and-truncate approach to solve the crystal structures of small proteins using rapidly computedab initiomodels

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s0907444912039194 ◽

2012 ◽

Vol 68 (12) ◽

pp. 1622-1631 ◽

Cited By ~ 74

Author(s):

Jaclyn Bibby ◽

Ronan M. Keegan ◽

Olga Mayans ◽

Martyn D. Winn ◽

Daniel J. Rigden

Keyword(s):

Ab Initio ◽

Crystal Structures ◽

Sequence Data ◽

Data File ◽

Test Cases ◽

Molecular Replacement ◽

Computational Pipeline ◽

Small Proteins ◽

Sequence File ◽

Local Accuracy

Proteinab initiomodels predicted from sequence data alone can enable the elucidation of crystal structures by molecular replacement. However, the calculation of suchab initiomodels is typically computationally expensive. Here, a computational pipeline based on the clustering and truncation of cheaply obtainedab initiomodels for the preparation of structure ensembles is described. Clustering is used to select models and to quantitatively predict their local accuracy, allowing rational truncation of predicted inaccurate regions. The resulting ensembles, with or without rapidly added side chains, solved 43% of all test cases, with an 80% success rate for all-α proteins. A program implementing this approach,AMPLE, is included in theCCP4 suite of programs. It only requires the input of aFASTAsequence file and a diffraction data file. It carries out the modelling using locally installedRosetta, creates search ensembles and automatically performs molecular replacement and model rebuilding.

Download Full-text