scholarly journals FASTC : a file format for multi‐character sequence data

Cladistics ◽  
2019 ◽  
Vol 35 (5) ◽  
pp. 573-575
Author(s):  
Ward C. Wheeler ◽  
Alexander J. Washburn
2020 ◽  
Author(s):  
Youri Hoogstrate ◽  
Guido Jenster ◽  
Harmen J. G. van de Werken

AbstractBackgroundThe FASTA file format used to store polymeric sequence data has become a bioinformatics file standard used for decades. The relatively large files require additional files beyond the scope of the original format, to identify sequences and provide random access. Currently, multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration.ResultsWe designed linux based a toolkit using Filesystem in Userspace (FUSE) that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using Zstandard (zstd). The toolkit, FASTAFS, can track all system wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy.ConclusionsFASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers the possibility to design format conversion without the need to rewrite code into different languages while preserving compatibility.Code Availabilityhttps://github.com/yhoogstrate/fastafs


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Youri Hoogstrate ◽  
Guido W. Jenster ◽  
Harmen J. G. van de Werken

Abstract Background The FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades. The relatively large files require additional files, beyond the scope of the original format, to identify sequences and to provide random access. Multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration. Results We designed a linux based toolkit that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem by using filesystem in userspace. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd). The toolkit, FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. The file compression ratios were comparable but not superior to other state of the art archival tools, despite the innovative random access feature implemented in FASTAFS. Conclusions FASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers format conversion without the need to rewrite code into different programming languages while preserving compatibility.


2019 ◽  
Vol 44 (4) ◽  
pp. 930-942
Author(s):  
Geraldine A. Allen ◽  
Luc Brouillet ◽  
John C. Semple ◽  
Heidi J. Guest ◽  
Robert Underhill

Abstract—Doellingeria and Eucephalus form the earliest-diverging clade of the North American Astereae lineage. Phylogenetic analyses of both nuclear and plastid sequence data show that the Doellingeria-Eucephalus clade consists of two main subclades that differ from current circumscriptions of the two genera. Doellingeria is the sister group to E. elegans, and the Doellingeria + E. elegans subclade in turn is sister to the subclade containing all remaining species of Eucephalus. In the plastid phylogeny, the two subclades are deeply divergent, a pattern that is consistent with an ancient hybridization event involving ancestral species of the Doellingeria-Eucephalus clade and an ancestral taxon of a related North American or South American group. Divergence of the two Doellingeria-Eucephalus subclades may have occurred in association with northward migration from South American ancestors. We combine these two genera under the older of the two names, Doellingeria, and propose 12 new combinations (10 species and two varieties) for all species of Eucephalus.


Sign in / Sign up

Export Citation Format

Share Document