Design and development of learning model for compression and processing of deoxyribonucleic acid genome sequence

Raveendra Gudodagi; Rayapur Venkata Siva Reddy; Mohammed Riyaz Ahmed

doi:10.11591/ijece.v12i2.pp1786-1794

Design and development of learning model for compression and processing of deoxyribonucleic acid genome sequence

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i2.pp1786-1794 ◽

2022 ◽

Vol 12 (2) ◽

pp. 1786

Author(s):

Raveendra Gudodagi ◽

Rayapur Venkata Siva Reddy ◽

Mohammed Riyaz Ahmed

Keyword(s):

Deep Learning ◽

Data Compression ◽

Data Storage ◽

Genome Sequence ◽

Sequence Data ◽

Genomic Data ◽

Modern Technology ◽

Optimization Techniques ◽

Learning System ◽

Systematic Analysis

Owing to the substantial volume of human genome sequence data files (from 30-200 GB exposed) Genomic data compression has received considerable traction and storage costs are one of the major problems faced by genomics laboratories. This involves a modern technology of data compression that reduces not only the storage but also the reliability of the operation. There were few attempts to solve this problem independently of both hardware and software. A systematic analysis of associations between genes provides techniques for the recognition of operative connections among genes and their respective yields, as well as understandings into essential biological events that are most important for knowing health and disease phenotypes. This research proposes a reliable and efficient deep learning system for learning embedded projections to combine gene interactions and gene expression in prediction comparison of deep embeddings to strong baselines. In this paper we preform data processing operations and predict gene function, along with gene ontology reconstruction and predict the gene interaction. The three major steps of genomic data compression are extraction of data, storage of data, and retrieval of the data. Hence, we propose a deep learning based on computational optimization techniques which will be efficient in all the three stages of data compression.

Download Full-text

Using Genomic Data to Infer Historic Population Dynamics of Nonmodel Organisms

Annual Review of Ecology Evolution and Systematics ◽

10.1146/annurev-ecolsys-110617-062431 ◽

2018 ◽

Vol 49 (1) ◽

pp. 433-456 ◽

Cited By ~ 42

Author(s):

Annabel C. Beichman ◽

Emilia Huerta-Sanchez ◽

Kirk E. Lohmueller

Keyword(s):

Population Dynamics ◽

Statistical Inference ◽

Statistical Methods ◽

Genome Sequence ◽

Sequence Data ◽

Demographic History ◽

Genomic Data ◽

Underlying Theory ◽

History Of ◽

Pros And Cons

Genome sequence data are now being routinely obtained from many nonmodel organisms. These data contain a wealth of information about the demographic history of the populations from which they originate. Many sophisticated statistical inference procedures have been developed to infer the demographic history of populations from this type of genomic data. In this review, we discuss the different statistical methods available for inference of demography, providing an overview of the underlying theory and logic behind each approach. We also discuss the types of data required and the pros and cons of each method. We then discuss how these methods have been applied to a variety of nonmodel organisms. We conclude by presenting some recommendations for researchers looking to use genomic data to infer demographic history.

Download Full-text

Early Sepsis Detection with Deep Learning on EHR Event Sequences

Dansk Tidsskrift for Akutmedicin ◽

10.7146/akut.v2i3.112949 ◽

2019 ◽

Vol 2 (3) ◽

pp. 39

Author(s):

Simon Meyer Lauritsen ◽

Mads Ellersgaard Kalør ◽

Emil Lund Kongsgaard ◽

Bo Thiesson

Keyword(s):

Deep Learning ◽

Domain Knowledge ◽

Healthcare Professionals ◽

Short Term Memory ◽

Sequence Data ◽

Positive Outcome ◽

Learning System ◽

High Morbidity ◽

Event Sequences ◽

Sepsis Detection

Background: Sepsis is a clinical condition involving an extreme inflammatory response to an infection, and is associated with high morbidity and mortality. Without intervention, this response can progress to septic shock, organ failure and death. Every hour that treatment is delayed mortality increases. Early identification of sepsis is therefore important for a positive outcome. Methods: We constructed predictive models for sepsis detection and performed a register-based cohort study on patients from four Danish municipalities. We used event-sequences of raw electronic health record (EHR) data from 2013 to 2017, where each event consists of three elements: a timestamp, an event category (e.g. medication code), and a value. In total, we consider 25.622 positive (SIRS criteria) sequences and 25.622 negative sequences with a total of 112 million events distributed across 64 different hospital units. The number of potential predictor variables in raw EHR data easily exceeds 10.000 and can be challenging for predictive modeling due to this large volume of sparse, heterogeneous events. Traditional approaches have dealt with this complexity by curating a limited number of variables of importance; a labor-intensive process that may discard a vast majority of information. In contrast, we consider a deep learning system constructed as a combination of a convolutional neural network (CNN) and long short-term memory (LSTM) network. Importantly, our system learns representations of the key factors and interactions from the raw event sequence data itself. Results: Our model predicts sepsis with an AUROC score of 0.8678, at 11 hours before actual treatment was started, outperforming all currently deployed approaches. At other prediction times, the model yields following AUROC scores. 15 min: 0.9058, 3 hours: 0.8803, 24 hours: 0.8073. Conclusion: We have presented a novel approach for early detection of sepsis that has more true positives and fewer false negatives than existing alarm systems without introducing domain knowledge into the model. Importantly, the model does not require changes in the daily workflow of healthcare professionals at hospitals, as the model is based on data that is routinely captured in the EHR. This also enables real-time prediction, as healthcare professionals enters the raw events in the EHR.

Download Full-text

FCompress: An Algorithm for FASTQ Sequence Data Compression

Current Bioinformatics ◽

10.2174/1574893613666180322125337 ◽

2019 ◽

Vol 14 (2) ◽

pp. 123-129

Author(s):

Muhammad Sardaraz ◽

Muhammad Tahir

Keyword(s):

Data Compression ◽

Data Storage ◽

Sequence Data ◽

Analysis Data ◽

General Purpose ◽

Huffman Coding ◽

Biological Sequence ◽

Sequencing Data ◽

Transmission Cost ◽

Capacity Data

Background: Biological sequence data have increased at a rapid rate due to the advancements in sequencing technologies and reduction in the cost of sequencing data. The huge increase in these data presents significant research challenges to researchers. In addition to meaningful analysis, data storage is also a challenge, an increase in data production is outpacing the storage capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as well as transmission cost over the internet. Objective: This article presents a novel compression algorithm (FCompress) for Next Generation Sequencing (NGS) data in FASTQ format. Method: The proposed algorithm uses bits manipulation and dictionary-based compression for bases compression. Headers are compressed with reference-based compression, whereas quality scores are compressed with Huffman coding. Results: The proposed algorithm is validated with experimental results on real datasets. The results are compared with both general purpose and specialized compression programs. Conclusion: The proposed algorithm produces better compression ratio in a comparable time to other algorithms.

Download Full-text

Hi-C Resolution Enhancement with Genome Sequence Data

10.1101/2021.10.25.465745 ◽

2021 ◽

Author(s):

Dmitrii Kriukov ◽

Nikita Koritskiy ◽

Igor Kozlovskii ◽

Mark Zaretckii ◽

Mariia Bazarevich ◽

...

Keyword(s):

Deep Learning ◽

Genome Sequence ◽

Sequence Data ◽

Resolution Enhancement ◽

Head Model ◽

Image Model ◽

Additional Information ◽

Genome Wide ◽

Data Resolution

The increasing interest in chromatin conformation inside the nucleus and the availability of genome-wide experimental data make it possible to develop computational methods that can increase the quality of the data and thus overcome the limitations of high experimental costs. Here we develop a deep-learning approach for increasing Hi-C data resolution by appending additional information about genome sequence. In this approach, we utilize two different deep-learning algorithms: the image-to-image model, which enhances Hi-C resolution by itself, and the sequence-to-image model, which uses additional information about the underlying genome sequence for further resolution improvement. Both models are combined with the simple head model that provides a more accurate enhancement of initial low-resolution Hi-C data. The code is freely available in a GitHub repository: https://github.com/koritsky/DL2021 HI-C

Download Full-text

Faculty Opinions recommendation of Optimal algorithms for haplotype assembly from whole-genome sequence data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.13339986.14707085 ◽

2011 ◽

Author(s):

Alejandro Schaffer

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

Whole Genome Sequence ◽

Whole Genome ◽

Optimal Algorithms ◽

Genome Sequence Data ◽

Haplotype Assembly

Download Full-text

Faculty Opinions recommendation of Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726028125.793519670 ◽

2016 ◽

Author(s):

Lars Malmstroem

Keyword(s):

Staphylococcus Aureus ◽

Antibiotic Resistance ◽

Mycobacterium Tuberculosis ◽

Genome Sequence ◽

Sequence Data ◽

Genome Sequence Data

Download Full-text

Development and validation of a deep learning system to screen vision-threatening conditions in high myopia using optical coherence tomography images

British Journal of Ophthalmology ◽

10.1136/bjophthalmol-2020-317825 ◽

2020 ◽

pp. bjophthalmol-2020-317825

Author(s):

Yonghao Li ◽

Weibo Feng ◽

Xiujuan Zhao ◽

Bingqian Liu ◽

Yan Zhang ◽

...

Keyword(s):

Optical Coherence Tomography ◽

Deep Learning ◽

High Myopia ◽

Large Scale ◽

Learning System ◽

Youden Index ◽

Optical Coherence ◽

Test Dataset ◽

Independent Test ◽

Independent Test Dataset

Background/aimsTo apply deep learning technology to develop an artificial intelligence (AI) system that can identify vision-threatening conditions in high myopia patients based on optical coherence tomography (OCT) macular images.MethodsIn this cross-sectional, prospective study, a total of 5505 qualified OCT macular images obtained from 1048 high myopia patients admitted to Zhongshan Ophthalmic Centre (ZOC) from 2012 to 2017 were selected for the development of the AI system. The independent test dataset included 412 images obtained from 91 high myopia patients recruited at ZOC from January 2019 to May 2019. We adopted the InceptionResnetV2 architecture to train four independent convolutional neural network (CNN) models to identify the following four vision-threatening conditions in high myopia: retinoschisis, macular hole, retinal detachment and pathological myopic choroidal neovascularisation. Focal Loss was used to address class imbalance, and optimal operating thresholds were determined according to the Youden Index.ResultsIn the independent test dataset, the areas under the receiver operating characteristic curves were high for all conditions (0.961 to 0.999). Our AI system achieved sensitivities equal to or even better than those of retina specialists as well as high specificities (greater than 90%). Moreover, our AI system provided a transparent and interpretable diagnosis with heatmaps.ConclusionsWe used OCT macular images for the development of CNN models to identify vision-threatening conditions in high myopia patients. Our models achieved reliable sensitivities and high specificities, comparable to those of retina specialists and may be applied for large-scale high myopia screening and patient follow-up.

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

A deep-learning system to classify lung X-ray images into normal/pneumonia class

International Journal of Infectious Diseases ◽

10.1016/j.ijid.2020.09.556 ◽

2020 ◽

Vol 101 ◽

pp. 209

Author(s):

R. Baskaran ◽

B. Ajay Rajasekaran ◽

V. Rajinikanth

Keyword(s):

Deep Learning ◽

Learning System ◽

X Ray

Download Full-text

Endoscopic prediction of submucosal invasion in Barrett’s cancer with the use of Artificial Intelligence: A pilot Study

Endoscopy ◽

10.1055/a-1311-8570 ◽

2020 ◽

Author(s):

Alanna Ebigbo ◽

Robert Mendel ◽

Tobias Rückert ◽

Laurin Schuster ◽

Andreas Probst ◽

...

Keyword(s):

Artificial Intelligence ◽

Deep Learning ◽

Pilot Study ◽

White Light ◽

Submucosal Invasion ◽

Learning System ◽

Endoscopic Images ◽

Barrett’S Cancer ◽

Significant Difference ◽

Sensitivity Specificity

Background and aims: The accurate differentiation between T1a and T1b Barrett’s cancer has both therapeutic and prognostic implications but is challenging even for experienced physicians. We trained an Artificial Intelligence (AI) system on the basis of deep artificial neural networks (deep learning) to differentiate between T1a and T1b Barrett’s cancer white-light images. Methods: Endoscopic images from three tertiary care centres in Germany were collected retrospectively. A deep learning system was trained and tested using the principles of cross-validation. A total of 230 white-light endoscopic images (108 T1a and 122 T1b) was evaluated with the AI-system. For comparison, the images were also classified by experts specialized in endoscopic diagnosis and treatment of Barrett’s cancer. Results: The sensitivity, specificity, F1 and accuracy of the AI-system in the differentiation between T1a and T1b cancer lesions was 0.77, 0.64, 0.73 and 0.71, respectively. There was no statistically significant difference between the performance of the AI-system and that of human experts with sensitivity, specificity, F1 and accuracy of 0.63, 0.78, 0.67 and 0.70 respectively. Conclusion: This pilot study demonstrates the first multicenter application of an AI-based system in the prediction of submucosal invasion in endoscopic images of Barrett’s cancer. AI scored equal to international experts in the field, but more work is necessary to improve the system and apply it to video sequences and in a real-life setting. Nevertheless, the correct prediction of submucosal invasion in Barret´s cancer remains challenging for both experts and AI.

Download Full-text