scholarly journals AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome

2021 ◽  
Author(s):  
Toshimichi Ikemura ◽  
Yuki Iwasaki ◽  
Kennosuke Wada ◽  
Yoshiko Wada ◽  
Takashi Abe
2021 ◽  
Author(s):  
Toshimichi Ikemura ◽  
Yuki Iwasaki ◽  
Kennosuke Wada ◽  
Yoshiko Wada ◽  
Takashi Abe

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: unsupervised and explainable AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of the viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4~6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers appeared after the epidemic start could be connected to mutations. Because BLSOM is an explainable AI, BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explained BLSOMs for various topics. The tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explained BLSOMs for various eukaryotes, such as fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found evident enrichments in transcription factor-binding sequences (TFBSs) in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) were separated by the corresponding amino acid.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Chao-Hsin Chen ◽  
Chao-Yu Pan ◽  
Wen-chang Lin

Abstract The completion of human genome sequences and the advancement of next-generation sequencing technologies have engendered a clear understanding of all human genes. Overlapping genes are usually observed in compact genomes, such as those of bacteria and viruses. Notably, overlapping protein-coding genes do exist in human genome sequences. Accordingly, we used the current Ensembl gene annotations to identify overlapping human protein-coding genes. We analysed 19,200 well-annotated protein-coding genes and determined that 4,951 protein-coding genes overlapped with their adjacent genes. Approximately a quarter of all human protein-coding genes were overlapping genes. We observed different clusters of overlapping protein-coding genes, ranging from two genes (paired overlapping genes) to 22 genes. We also divided the paired overlapping protein-coding gene groups into four subtypes. We found that the divergent overlapping gene subtype had a stronger expression association than did the subtypes of 5ʹ-tandem overlapping and 3ʹ-tandem overlapping genes. The majority of paired overlapping genes exhibited comparable coincidental tissue expression profiles; however, a few overlapping gene pairs displayed distinctive tissue expression association patterns. In summary, we have carefully examined the genomic features and distributions about human overlapping protein-coding genes and found coincidental expression in tissues for most overlapping protein-coding genes.


2013 ◽  
Vol 11 (2) ◽  
pp. 77-85 ◽  
Author(s):  
Ben C. Shirley ◽  
Eliseos J. Mucaki ◽  
Tyson Whitehead ◽  
Paul I. Costea ◽  
Pelin Akan ◽  
...  

2016 ◽  
Vol 48 (1) ◽  
Author(s):  
Mekki Boussaha ◽  
Pauline Michot ◽  
Rabia Letaief ◽  
Chris Hozé ◽  
Sébastien Fritz ◽  
...  

2010 ◽  
Vol 11 (8) ◽  
pp. R88 ◽  
Author(s):  
Martin G Reese ◽  
Barry Moore ◽  
Colin Batchelor ◽  
Fidel Salas ◽  
Fiona Cunningham ◽  
...  

2020 ◽  
Author(s):  
Inácio Gomes Medeiros ◽  
André Salim Khayat ◽  
Beatriz Stransky ◽  
Sidney Emanuel Batista dos Santos ◽  
Paulo Pimentel de Assumpção ◽  
...  

Abstract This protocol aims to describe the building of a database of SARS-CoV-2 targets for siRNA approaches. Starting from the virus reference genome, we will derive sequences from 18 to 21nt-long and verify their similarity against the human genome and coding and non-coding transcriptome, as well as genomes from related viruses. We will also calculate a set of thermodynamic features for those sequences and will infer their efficiencies using three different predictors. The protocol has two main phases: at first, we align sequences against reference genomes. In the second one, we extract the features. The first phase varies in terms of duration, depending on computational power from the running machine and the number of reference genomes. Despite that, the second phase lasts about thirty minutes of execution, also depending on the number of cores of running machine. The constructed database aims to speed the design process by providing a broad set of possible SARS-CoV-2 sequences targets and siRNA sequences.


2017 ◽  
Vol 26 (16) ◽  
pp. 4145-4157 ◽  
Author(s):  
Michael D. Martin ◽  
Flora Jay ◽  
Sergi Castellano ◽  
Montgomery Slatkin

Sign in / Sign up

Export Citation Format

Share Document