Biological Data Classification and Analysis Using Convolutional Neural Network

2020 ◽  
Vol 10 (10) ◽  
pp. 2459-2465
Author(s):  
Iftikhar Ahmad ◽  
Muhammad Javed Iqbal ◽  
Mohammad Basheri

The size of data gathered from various ongoing biological and clinically studies is increasing at an exponential rate. The bio-inspired data mainly comprises of genes of DNA, protein and variety of proteomics and genetic diseases. Additionally, DNA microarray data is also available for early diagnosis and prediction of various types of cancer diseases. Interestingly, this data may store very vital information about genes, their structure and important biological function. The huge volume and constant increase in the extracted bio data has opened several challenges. Many bioinformatics and machine learning models have been developed but those fail to address key challenges presents in the efficient and accurate analysis of variety of complex biologically inspired data such as genetic diseases etc. The reliable and robust process of classifying the extracted data into different classes based on the information hidden in the sample data is also a very interesting and open problem. This research work mainly focuses to overcome major challenges in the accurate protein classification keeping in view of the success of deep learning models in natural language processing since it assumes the proteins sequences as a language. The learning ability and overall classification performance of the proposed system can be validated with deep learning classification models. The proposed system can have the superior ability to accurately classify the mentioned datasets than previous approaches and shows better results. The in-depth analysis of multifaceted biological data may also help in the early diagnosis of diseases that causes due to mutation of genes and to overcome arising challenges in the development of large-scale healthcare systems.

2020 ◽  
Vol 10 (10) ◽  
pp. 2459-2465
Author(s):  
Iftikhar Ahmad ◽  
Muhammad Javed Iqbal ◽  
Mohammad Basheri

The size of data gathered from various ongoing biological and clinically studies is increasing at an exponential rate. The bio-inspired data mainly comprises of genes of DNA, protein and variety of proteomics and genetic diseases. Additionally, DNA microarray data is also available for early diagnosis and prediction of various types of cancer diseases. Interestingly, this data may store very vital information about genes, their structure and important biological function. The huge volume and constant increase in the extracted bio data has opened several challenges. Many bioinformatics and machine learning models have been developed but those fail to address key challenges presents in the efficient and accurate analysis of variety of complex biologically inspired data such as genetic diseases etc. The reliable and robust process of classifying the extracted data into different classes based on the information hidden in the sample data is also a very interesting and open problem. This research work mainly focuses to overcome major challenges in the accurate protein classification keeping in view of the success of deep learning models in natural language processing since it assumes the proteins sequences as a language. The learning ability and overall classification performance of the proposed system can be validated with deep learning classification models. The proposed system can have the superior ability to accurately classify the mentioned datasets than previous approaches and shows better results. The in-depth analysis of multifaceted biological data may also help in the early diagnosis of diseases that causes due to mutation of genes and to overcome arising challenges in the development of large-scale healthcare systems.


Author(s):  
Wenjia Cai ◽  
Jie Xu ◽  
Ke Wang ◽  
Xiaohong Liu ◽  
Wenqin Xu ◽  
...  

Abstract Anterior segment eye diseases account for a significant proportion of presentations to eye clinics worldwide, including diseases associated with corneal pathologies, anterior chamber abnormalities (e.g. blood or inflammation) and lens diseases. The construction of an automatic tool for the segmentation of anterior segment eye lesions will greatly improve the efficiency of clinical care. With research on artificial intelligence progressing in recent years, deep learning models have shown their superiority in image classification and segmentation. The training and evaluation of deep learning models should be based on a large amount of data annotated with expertise, however, such data are relatively scarce in the domain of medicine. Herein, the authors developed a new medical image annotation system, called EyeHealer. It is a large-scale anterior eye segment dataset with both eye structures and lesions annotated at the pixel level. Comprehensive experiments were conducted to verify its performance in disease classification and eye lesion segmentation. The results showed that semantic segmentation models outperformed medical segmentation models. This paper describes the establishment of the system for automated classification and segmentation tasks. The dataset will be made publicly available to encourage future research in this area.


2019 ◽  
Author(s):  
Mojtaba Haghighatlari ◽  
Gaurav Vishwakarma ◽  
Mohammad Atif Faiz Afzal ◽  
Johannes Hachmann

<div><div><div><p>We present a multitask, physics-infused deep learning model to accurately and efficiently predict refractive indices (RIs) of organic molecules, and we apply it to a library of 1.5 million compounds. We show that it outperforms earlier machine learning models by a significant margin, and that incorporating known physics into data-derived models provides valuable guardrails. Using a transfer learning approach, we augment the model to reproduce results consistent with higher-level computational chemistry training data, but with a considerably reduced number of corresponding calculations. Prediction errors of machine learning models are typically smallest for commonly observed target property values, consistent with the distribution of the training data. However, since our goal is to identify candidates with unusually large RI values, we propose a strategy to boost the performance of our model in the remoter areas of the RI distribution: We bias the model with respect to the under-represented classes of molecules that have values in the high-RI regime. By adopting a metric popular in web search engines, we evaluate our effectiveness in ranking top candidates. We confirm that the models developed in this study can reliably predict the RIs of the top 1,000 compounds, and are thus able to capture their ranking. We believe that this is the first study to develop a data-derived model that ensures the reliability of RI predictions by model augmentation in the extrapolation region on such a large scale. These results underscore the tremendous potential of machine learning in facilitating molecular (hyper)screening approaches on a massive scale and in accelerating the discovery of new compounds and materials, such as organic molecules with high-RI for applications in opto-electronics.</p></div></div></div>


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Juncai Li ◽  
Xiaofei Jiang

Molecular property prediction is an essential task in drug discovery. Most computational approaches with deep learning techniques either focus on designing novel molecular representation or combining with some advanced models together. However, researchers pay fewer attention to the potential benefits in massive unlabeled molecular data (e.g., ZINC). This task becomes increasingly challenging owing to the limitation of the scale of labeled data. Motivated by the recent advancements of pretrained models in natural language processing, the drug molecule can be naturally viewed as language to some extent. In this paper, we investigate how to develop the pretrained model BERT to extract useful molecular substructure information for molecular property prediction. We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks. To examine the performance of our proposed Mol-BERT, we conduct several experiments on 4 widely used molecular datasets. In comparison to the traditional and state-of-the-art baselines, the results illustrate that our proposed Mol-BERT can outperform the current sequence-based methods and achieve at least 2% improvement on ROC-AUC score on Tox21, SIDER, and ClinTox dataset.


2021 ◽  
Author(s):  
Jaydip Sen ◽  
Sidra Mehtab ◽  
Gourab Nath

Prediction of future movement of stock prices has been a subject matter of many research work. On one hand, we have proponents of the Efficient Market Hypothesis who claim that stock prices cannot be predicted, on the other hand, there are propositions illustrating that, if appropriately modeled, stock prices can be predicted with a high level of accuracy. There is also a gamut of literature on technical analysis of stock prices where the objective is to identify patterns in stock price movements and profit from it. In this work, we propose a hybrid approach for stock price prediction using five deep learning-based regression models. We select the NIFTY 50 index values of the National Stock Exchange (NSE) of India, over a period of December 29, 2014 to July 31, 2020. Based on the NIFTY data during December 29, 2014 to December 28, 2018, we build two regression models using <i>convolutional neural networks</i> (CNNs), and three regression models using <i>long-and-short-term memory</i> (LSTM) networks for predicting the <i>open</i> values of the NIFTY 50 index records for the period December 31, 2018 to July 31, 2020. We adopted a multi-step prediction technique with <i>walk-forward validation</i>. The parameters of the five deep learning models are optimized using the grid-search technique so that the validation losses of the models stabilize with an increasing number of epochs in the model training, and the training and validation accuracies converge. Extensive results are presented on various metrics for all the proposed regression models. The results indicate that while both CNN and LSTM-based regression models are very accurate in forecasting the NIFTY 50 <i>open</i> values, the CNN model that previous one week’s data as the input is the fastest in its execution. On the other hand, the encoder-decoder convolutional LSTM model uses the previous two weeks’ data as the input is found to be the most accurate in its forecasting results.


Author(s):  
Janjanam Prabhudas ◽  
C. H. Pradeep Reddy

The enormous increase of information along with the computational abilities of machines created innovative applications in natural language processing by invoking machine learning models. This chapter will project the trends of natural language processing by employing machine learning and its models in the context of text summarization. This chapter is organized to make the researcher understand technical perspectives regarding feature representation and their models to consider before applying on language-oriented tasks. Further, the present chapter revises the details of primary models of deep learning, its applications, and performance in the context of language processing. The primary focus of this chapter is to illustrate the technical research findings and gaps of text summarization based on deep learning along with state-of-the-art deep learning models for TS.


Author(s):  
Yilin Yan ◽  
Jonathan Chen ◽  
Mei-Ling Shyu

Stance detection is an important research direction which attempts to automatically determine the attitude (positive, negative, or neutral) of the author of text (such as tweets), towards a target. Nowadays, a number of frameworks have been proposed using deep learning techniques that show promising results in application domains such as automatic speech recognition and computer vision, as well as natural language processing (NLP). This article shows a novel deep learning-based fast stance detection framework in bipolar affinities on Twitter. It is noted that millions of tweets regarding Clinton and Trump were produced per day on Twitter during the 2016 United States presidential election campaign, and thus it is used as a test use case because of its significant and unique counter-factual properties. In addition, stance detection can be utilized to imply the political tendency of the general public. Experimental results show that the proposed framework achieves high accuracy results when compared to several existing stance detection methods.


2018 ◽  
Vol 7 (2.14) ◽  
pp. 5726
Author(s):  
Oumaima Hourrane ◽  
El Habib Benlahmar ◽  
Ahmed Zellou

Sentiment analysis is one of the new absorbing parts appeared in natural language processing with the emergence of community sites on the web. Taking advantage of the amount of information now available, research and industry have been seeking ways to automatically analyze the sentiments expressed in texts. The challenge for this task is the human language ambiguity, and also the lack of labeled data. In order to solve this issue, sentiment analysis and deep learning have been merged as deep learning models are effective due to their automatic learning capability. In this paper, we provide a comparative study on IMDB movie review dataset, we compare word embeddings and further deep learning models on sentiment analysis and give broad empirical outcomes for those keen on taking advantage of deep learning for sentiment analysis in real-world settings.


2020 ◽  
Vol 34 (7) ◽  
pp. 717-730 ◽  
Author(s):  
Matthew C. Robinson ◽  
Robert C. Glen ◽  
Alpha A. Lee

Abstract Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodological approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Additionally, using a series of numerical experiments, we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision–recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical experiments also highlight challenges in estimating the uncertainty in model performance via scaffold-split nested cross validation.


Sign in / Sign up

Export Citation Format

Share Document