scholarly journals Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

Genes ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 41 ◽  
Author(s):  
Mengli Xiao ◽  
Zhong Zhuang ◽  
Wei Pan

Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.

Entropy ◽  
2020 ◽  
Vol 23 (1) ◽  
pp. 21
Author(s):  
Yury Rodimkov ◽  
Evgeny Efimenko ◽  
Valentin Volokitin ◽  
Elena Panova ◽  
Alexey Polovinkin ◽  
...  

When entering the phase of big data processing and statistical inferences in experimental physics, the efficient use of machine learning methods may require optimal data preprocessing methods and, in particular, optimal balance between details and noise. In experimental studies of strong-field quantum electrodynamics with intense lasers, this balance concerns data binning for the observed distributions of particles and photons. Here we analyze the aspect of binning with respect to different machine learning methods (Support Vector Machine (SVM), Gradient Boosting Trees (GBT), Fully-Connected Neural Network (FCNN), Convolutional Neural Network (CNN)) using numerical simulations that mimic expected properties of upcoming experiments. We see that binning can crucially affect the performance of SVM and GBT, and, to a less extent, FCNN and CNN. This can be interpreted as the latter methods being able to effectively learn the optimal binning, discarding unnecessary information. Nevertheless, given limited training sets, the results indicate that the efficiency can be increased by optimizing the binning scale along with other hyperparameters. We present specific measurements of accuracy that can be useful for planning of experiments in the specified research area.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Sorayya Rezayi ◽  
Niloofar Mohammadzadeh ◽  
Hamid Bouraghi ◽  
Soheila Saeedi ◽  
Ali Mohammadpour

Background. Leukemia is fatal cancer in both children and adults and is divided into acute and chronic. Acute lymphoblastic leukemia (ALL) is a subtype of this cancer. Early diagnosis of this disease can have a significant impact on the treatment of this disease. Computational intelligence-oriented techniques can be used to help physicians identify and classify ALL rapidly. Materials and Method. In this study, the utilized dataset was collected from a CodaLab competition to classify leukemic cells from normal cells in microscopic images. Two famous deep learning networks, including residual neural network (ResNet-50) and VGG-16 were employed. These two networks are already trained by our assigned parameters, meaning we did not use the stored weights; we adjusted the weights and learning parameters too. Also, a convolutional network with ten convolutional layers and 2 ∗ 2 max-pooling layers—with strides 2—was proposed, and six common machine learning techniques were developed to classify acute lymphoblastic leukemia into two classes. Results. The validation accuracies (the mean accuracy of training and test networks for 100 training cycles) of the ResNet-50, VGG-16, and the proposed convolutional network were found to be 81.63%, 84.62%, and 82.10%, respectively. Among applied machine learning methods, the lowest obtained accuracy was related to multilayer perceptron (27.33%) and highest for random forest (81.72%). Conclusion. This study showed that the proposed convolutional neural network has optimal accuracy in the diagnosis of ALL. By comparing various convolutional neural networks and machine learning methods in diagnosing this disease, the convolutional neural network achieved good performance and optimal execution time without latency. This proposed network is less complex than the two pretrained networks and can be employed by pathologists and physicians in clinical systems for leukemia diagnosis.


2021 ◽  
Vol 11 (10) ◽  
pp. 4499
Author(s):  
Mei-Ling Huang ◽  
Yun-Zhi Li

Major League Baseball (MLB) is the highest level of professional baseball in the world and accounts for some of the most popular international sporting events. Many scholars have conducted research on predicting the outcome of MLB matches. The accuracy in predicting the results of baseball games is low. Therefore, deep learning and machine learning methods were used to build models for predicting the outcomes (win/loss) of MLB matches and investigate the differences between the models in terms of their performance. The match data of 30 teams during the 2019 MLB season with only the starting pitcher or with all pitchers in the pitcher category were collected to compare the prediction accuracy. A one-dimensional convolutional neural network (1DCNN), a traditional machine learning artificial neural network (ANN), and a support vector machine (SVM) were used to predict match outcomes with fivefold cross-validation to evaluate model performance. The highest prediction accuracies were 93.4%, 93.91%, and 93.90% with the 1DCNN, ANN, SVM models, respectively, before feature selection; after feature selection, the highest accuracies obtained were 94.18% and 94.16% with the ANN and SVM models, respectively. The prediction results obtained with the three models were similar, and the prediction accuracies were much higher than those obtained in related studies. Moreover, a 1DCNN was used for the first time for predicting the outcome of MLB matches, and it achieved a prediction accuracy similar to that achieved by machine learning methods.


2019 ◽  
Vol 11 (23) ◽  
pp. 2801 ◽  
Author(s):  
Yonghong Zhang ◽  
Taotao Ge ◽  
Wei Tian ◽  
Yuei-An Liou

Debris flows have been always a serious problem in the mountain areas. Research on the assessment of debris flows susceptibility (DFS) is useful for preventing and mitigating debris flow risks. The main purpose of this work is to study the DFS in the Shigatse area of Tibet, by using machine learning methods, after assessing the main triggering factors of debris flows. Remote sensing and geographic information system (GIS) are used to obtain datasets of topography, vegetation, human activities and soil factors for local debris flows. The problem of debris flow susceptibility level imbalances in datasets is addressed by the Borderline-SMOTE method. Five machine learning methods, i.e., back propagation neural network (BPNN), one-dimensional convolutional neural network (1D-CNN), decision tree (DT), random forest (RF), and extreme gradient boosting (XGBoost) have been used to analyze and fit the relationship between debris flow triggering factors and occurrence, and to evaluate the weight of each triggering factor. The ANOVA and Tukey HSD tests have revealed that the XGBoost model exhibited the best mean accuracy (0.924) on ten-fold cross-validation and the performance was significantly better than that of the BPNN (0.871), DT (0.816), and RF (0.901). However, the performance of the XGBoost did not significantly differ from that of the 1D-CNN (0.914). This is also the first comparison experiment between XGBoost and 1D-CNN methods in the DFS study. The DFS maps have been verified by five evaluation methods: Precision, Recall, F1 score, Accuracy and area under the curve (AUC). Experiments show that the XGBoost has the best score, and the factors that have a greater impact on debris flows are aspect, annual average rainfall, profile curvature, and elevation.


2021 ◽  
Author(s):  
Benedikt Schulz ◽  
Sebastian Lerch

<p>We conduct a systematic and comprehensive comparison of state-of-the-art postprocessing methods for ensemble forecasts of wind gusts. The compared approaches range from well-established techniques to novel neural network-based methods. Our study is based on a 6-year dataset of forecasts from the convection‐permitting COSMO‐DE ensemble prediction system, with hourly lead times up to 21 hours and forecasts of 57 meteorological variables, and corresponding observations from 175<strong> </strong>weather stations over Germany. We find that simpler methods such as ensemble model output statistics (EMOS), member-by-member postprocessing and a novel isotonic distributional regression approach, which utilize ensemble forecasts of wind gusts as sole inputs, already result in improvement in terms of the mean CRPS of up to 40% compared to the raw ensemble predictions. This can be substantially improved upon by more complex machine learning methods such as gradient boosting-based extensions of EMOS, quantile regression forests, and variants of neural network-based approaches that are capable of incorporating additional information from the large variety of available predictor variables.</p>


2018 ◽  
Vol 8 (9) ◽  
pp. 1573 ◽  
Author(s):  
Vladimir Kulyukin ◽  
Sarbajit Mukherjee ◽  
Prakhar Amlathe

Electronic beehive monitoring extracts critical information on colony behavior and phenology without invasive beehive inspections and transportation costs. As an integral component of electronic beehive monitoring, audio beehive monitoring has the potential to automate the identification of various stressors for honeybee colonies from beehive audio samples. In this investigation, we designed several convolutional neural networks and compared their performance with four standard machine learning methods (logistic regression, k-nearest neighbors, support vector machines, and random forests) in classifying audio samples from microphones deployed above landing pads of Langstroth beehives. On a dataset of 10,260 audio samples where the training and testing samples were separated from the validation samples by beehive and location, a shallower raw audio convolutional neural network with a custom layer outperformed three deeper raw audio convolutional neural networks without custom layers and performed on par with the four machine learning methods trained to classify feature vectors extracted from raw audio samples. On a more challenging dataset of 12,914 audio samples where the training and testing samples were separated from the validation samples by beehive, location, time, and bee race, all raw audio convolutional neural networks performed better than the four machine learning methods and a convolutional neural network trained to classify spectrogram images of audio samples. A trained raw audio convolutional neural network was successfully tested in situ on a low voltage Raspberry Pi computer, which indicates that convolutional neural networks can be added to a repertoire of in situ audio classification algorithms for electronic beehive monitoring. The main trade-off between deep learning and standard machine learning is between feature engineering and training time: while the convolutional neural networks required no feature engineering and generalized better on the second, more challenging dataset, they took considerably more time to train than the machine learning methods. To ensure the replicability of our findings and to provide performance benchmarks for interested research and citizen science communities, we have made public our source code and our curated datasets.


2020 ◽  
Vol 4 (3) ◽  
pp. 476-481
Author(s):  
HR. Wibi Bagas N ◽  
Evang Mailoa ◽  
Hindriyanto Dwi Purnomo

The fruit is part of the flowers in plants that are produced from pollination of the pistils and stamens. The shape and color of many fruits with a variety, with the type of single fruit, double fruit and compound fruit. This study asks for the development of 10 pieces detection applications to help the sensor agriculture sector for 10 pieces detection. The data in this study used the image of 10 fruits namely Mangosteen, Delicious, Star Fruit, Water Guava, Kiwi, Pear, Pineapple, Salak, Dragon Fruit, and Strawberry. Training and testing using CNN algorithms and YOLOv3 machine learning methods with the support of the work of the Darknet53 neural network. The analysis was conducted using 2,333 images of data from 10 classes. The training process is carried out up to 5,000 iterations stored in checkpoints. The implementation of the detection of 10 pieces was carried out on Google Collaboratory through imagery with two tests. Accuracy in the detection of 10 pieces can reach more than 90% in the first test of each fruit and an average of 70% in the second test for images outside the test data.


Energies ◽  
2021 ◽  
Vol 14 (15) ◽  
pp. 4595
Author(s):  
Parisa Asadi ◽  
Lauren E. Beckingham

X-ray CT imaging provides a 3D view of a sample and is a powerful tool for investigating the internal features of porous rock. Reliable phase segmentation in these images is highly necessary but, like any other digital rock imaging technique, is time-consuming, labor-intensive, and subjective. Combining 3D X-ray CT imaging with machine learning methods that can simultaneously consider several extracted features in addition to color attenuation, is a promising and powerful method for reliable phase segmentation. Machine learning-based phase segmentation of X-ray CT images enables faster data collection and interpretation than traditional methods. This study investigates the performance of several filtering techniques with three machine learning methods and a deep learning method to assess the potential for reliable feature extraction and pixel-level phase segmentation of X-ray CT images. Features were first extracted from images using well-known filters and from the second convolutional layer of the pre-trained VGG16 architecture. Then, K-means clustering, Random Forest, and Feed Forward Artificial Neural Network methods, as well as the modified U-Net model, were applied to the extracted input features. The models’ performances were then compared and contrasted to determine the influence of the machine learning method and input features on reliable phase segmentation. The results showed considering more dimensionality has promising results and all classification algorithms result in high accuracy ranging from 0.87 to 0.94. Feature-based Random Forest demonstrated the best performance among the machine learning models, with an accuracy of 0.88 for Mancos and 0.94 for Marcellus. The U-Net model with the linear combination of focal and dice loss also performed well with an accuracy of 0.91 and 0.93 for Mancos and Marcellus, respectively. In general, considering more features provided promising and reliable segmentation results that are valuable for analyzing the composition of dense samples, such as shales, which are significant unconventional reservoirs in oil recovery.


Author(s):  
Mehmet Şahin ◽  
Murat Uçar

In this study, a comparative analysis for predicting sports attendance demand is presented based on econometric, artificial intelligence, and machine learning methodologies. Data from more than 20,000 games from three major leagues, namely the National Basketball Association (NBA), National Football League (NFL), and Major League Baseball (MLB), were used for training and testing the approaches. The relevant literature was examined to determine the most useful variables as potential regressors in forecasting. To reveal the most effective approach, three scenarios containing seven cases were constructed. In the first scenario, each league was evaluated separately. In the second scenario, the three possible combinations of league pairings were evaluated, while in the third scenario, all three leagues were evaluated together. The performance evaluations of the results suggest that one of the machine learning methods, Gradient Boosting, outperformed the other methods used. However, the Artificial Neural Network, deep Convolutional Neural Network, and Decision Trees also provided productive and competitive predictions for sports games. Based on the results, the predictions for the NBA and NFL leagues are more satisfactory than the predictions of the MLB, which may be caused by the structure of the MLB. The results of the sensitivity analysis indicate that the performance of the home team is the most influential factor for all three leagues.


Sign in / Sign up

Export Citation Format

Share Document