Text Mining Drug-Protein Interactions using an Ensemble of BERT, Sentence BERT and T5 models

Mapping Intimacies ◽

10.1101/2021.10.26.465944 ◽

2021 ◽

Author(s):

Xin Sui ◽

Wanjing Wang ◽

Jinfeng Zhang

Keyword(s):

Protein Interactions ◽

Clustering Algorithm ◽

Data Augmentation ◽

Majority Vote ◽

Classification Model ◽

Ensemble Model ◽

K Nearest Neighbors ◽

Test Dataset ◽

Improved Performance ◽

Using Data

In this work, we trained an ensemble model for predicting drug-protein interactions within a sentence based on only its semantics. Our ensembled model was built using three separate models: 1) a classification model using a fine-tuned BERT model; 2) a fine-tuned sentence BERT model that embeds every sentence into a vector; and 3) another classification model using a fine-tuned T5 model. In all models, we further improved performance using data augmentation. For model 2, we predicted the label of a sentence using k-nearest neighbors with its embedded vector. We also explored ways to ensemble these 3 models: a) we used the majority vote method to ensemble these 3 models; and b) based on the HDBSCAN clustering algorithm, we trained another ensemble model using features from all the models to make decisions. Our best model achieved an F-1 score of 0.753 on the BioCreative VII Track 1 test dataset.

Download Full-text

A New Hybrid Support Vector Machine Ensemble Classification Model for Credit Scoring

Journal of Information Technology Research ◽

10.4018/jitr.2019010106 ◽

2019 ◽

Vol 12 (1) ◽

pp. 77-88

Author(s):

Jian-Rong Yao ◽

Jia-Rui Chen

Keyword(s):

Credit Scoring ◽

Ensemble Methods ◽

Ensemble Classification ◽

Classification Model ◽

Support Vector ◽

Ensemble Model ◽

Financial Industry ◽

K Nearest Neighbors ◽

Regression Methods ◽

Vector Machines

Credit scoring plays important role in the financial industry. There are different ways employed in the field of credit scoring, such as the traditional logistic regression, discriminant analysis, and linear regression; methods used in the field of machine learning include neural network, k-nearest neighbors, genetic algorithm, support vector machines (SVM), decision tree, and so on. SVM has been demonstrated with good performance in classification. This paper proposes a new hybrid RF-SVM ensemble model, which uses random forest to select important variables, and employs ensemble methods (bagging and boosting) to aggregate single base models (SVM) as a robust classifier. The experimental results suggest that this new model could achieve effective improvement, and has promising potential in the field of credit scoring.

Download Full-text

Support vector approach using data augmentation with copula function applied to load forecasting

10.26678/abcm.cobem2019.cob2019-1042 ◽

2019 ◽

Author(s):

Gabriel Ribeiro ◽

Marcos Yamasaki ◽

Helon Vicente Hultmann Ayala ◽

Leandro Coelho ◽

Viviana Mariani

Keyword(s):

Data Augmentation ◽

Load Forecasting ◽

Copula Function ◽

Support Vector ◽

Using Data ◽

Vector Approach

Download Full-text

K-MEANS CLUSTERING ALGORITHM FOR SERVICE DATA ANALYSIS BASED ON CUSTOMERS COMBINATION

Unes journal of Information System ◽

10.31933/ujis.3.1.001-007.2018 ◽

2018 ◽

Vol 3 (1) ◽

pp. 001

Author(s):

Zulhendra Zulhendra ◽

Gunadi Widi Nurcahyo ◽

Julius Santony

Keyword(s):

Data Mining ◽

Data Analysis ◽

Clustering Algorithm ◽

Customer Complaints ◽

Using Data ◽

Clustering Data ◽

Service Data ◽

Selection Of

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.

Download Full-text

An efficient technique for CT scan images classification of COVID-19

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201985 ◽

2020 ◽

pp. 1-14

Author(s):

Esraa Hassan ◽

Noha A. Hikal ◽

Samir Elmuogy

Keyword(s):

Neural Network ◽

Diagnostic Tool ◽

Deep Neural Network ◽

Data Augmentation ◽

Performance Metrics ◽

Classification Model ◽

Data Set ◽

Training Models ◽

The Earth ◽

Diagnostic Time

Nowadays, Coronavirus (COVID-19) considered one of the most critical pandemics in the earth. This is due its ability to spread rapidly between humans as well as animals. COVID_19 expected to outbreak around the world, around 70 % of the earth population might infected with COVID-19 in the incoming years. Therefore, an accurate and efficient diagnostic tool is highly required, which the main objective of our study. Manual classification was mainly used to detect different diseases, but it took too much time in addition to the probability of human errors. Automatic image classification reduces doctors diagnostic time, which could save human’s life. We propose an automatic classification architecture based on deep neural network called Worried Deep Neural Network (WDNN) model with transfer learning. Comparative analysis reveals that the proposed WDNN model outperforms by using three pre-training models: InceptionV3, ResNet50, and VGG19 in terms of various performance metrics. Due to the shortage of COVID-19 data set, data augmentation was used to increase the number of images in the positive class, then normalization used to make all images have the same size. Experimentation is done on COVID-19 dataset collected from different cases with total 2623 where (1573 training,524 validation,524 test). Our proposed model achieved 99,046, 98,684, 99,119, 98,90 In terms of Accuracy, precision, Recall, F-score, respectively. The results are compared with both the traditional machine learning methods and those using Convolutional Neural Networks (CNNs). The results demonstrate the ability of our classification model to use as an alternative of the current diagnostic tool.

Download Full-text

AI-based localization and classification of skin disease with erythema

Scientific Reports ◽

10.1038/s41598-021-84593-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ha Min Son ◽

Wooho Jeon ◽

Jinhyun Kim ◽

Chan Yeong Heo ◽

Hye Jin Yoon ◽

...

Keyword(s):

Neural Network ◽

Skin Diseases ◽

Classification Model ◽

Screening Tests ◽

Sensitivity Score ◽

Common Skin ◽

Novel Method ◽

Improved Performance ◽

High Level

AbstractAlthough computer-aided diagnosis (CAD) is used to improve the quality of diagnosis in various medical fields such as mammography and colonography, it is not used in dermatology, where noninvasive screening tests are performed only with the naked eye, and avoidable inaccuracies may exist. This study shows that CAD may also be a viable option in dermatology by presenting a novel method to sequentially combine accurate segmentation and classification models. Given an image of the skin, we decompose the image to normalize and extract high-level features. Using a neural network-based segmentation model to create a segmented map of the image, we then cluster sections of abnormal skin and pass this information to a classification model. We classify each cluster into different common skin diseases using another neural network model. Our segmentation model achieves better performance compared to previous studies, and also achieves a near-perfect sensitivity score in unfavorable conditions. Our classification model is more accurate than a baseline model trained without segmentation, while also being able to classify multiple diseases within a single image. This improved performance may be sufficient to use CAD in the field of dermatology.

Download Full-text

UAV Image Multi-Labeling with Data-Efficient Transformers

Applied Sciences ◽

10.3390/app11093974 ◽

2021 ◽

Vol 11 (9) ◽

pp. 3974

Author(s):

Laila Bashmal ◽

Yakoub Bazi ◽

Mohamad Mahmoud Al Rahhal ◽

Haikel Alhichri ◽

Naif Al Ajlan

Keyword(s):

Data Augmentation ◽

Feature Representation ◽

Aerial Image ◽

Remote Sensing Images ◽

Training Set ◽

Proposed Model ◽

Class Labels ◽

Using Data ◽

Uav Image

In this paper, we present an approach for the multi-label classification of remote sensing images based on data-efficient transformers. During the training phase, we generated a second view for each image from the training set using data augmentation. Then, both the image and its augmented version were reshaped into a sequence of flattened patches and then fed to the transformer encoder. The latter extracts a compact feature representation from each image with the help of a self-attention mechanism, which can handle the global dependencies between different regions of the high-resolution aerial image. On the top of the encoder, we mounted two classifiers, a token and a distiller classifier. During training, we minimized a global loss consisting of two terms, each corresponding to one of the two classifiers. In the test phase, we considered the average of the two classifiers as the final class labels. Experiments on two datasets acquired over the cities of Trento and Civezzano with a ground resolution of two-centimeter demonstrated the effectiveness of the proposed model.

Download Full-text

Optimal Randomness for Stochastic Configuration Network (SCN) with Heavy-Tailed Distributions

Entropy ◽

10.3390/e23010056 ◽

2020 ◽

Vol 23 (1) ◽

pp. 56

Author(s):

Haoyu Niu ◽

Jiamin Wei ◽

YangQuan Chen

Keyword(s):

Research Work ◽

Cauchy Distribution ◽

Classification Model ◽

Test Accuracy ◽

Functional Link ◽

Heavy Tailed Distributions ◽

Heavy Tailed ◽

Improved Performance ◽

Hidden Layer ◽

Hidden Nodes

Stochastic Configuration Network (SCN) has a powerful capability for regression and classification analysis. Traditionally, it is quite challenging to correctly determine an appropriate architecture for a neural network so that the trained model can achieve excellent performance for both learning and generalization. Compared with the known randomized learning algorithms for single hidden layer feed-forward neural networks, such as Randomized Radial Basis Function (RBF) Networks and Random Vector Functional-link (RVFL), the SCN randomly assigns the input weights and biases of the hidden nodes in a supervisory mechanism. Since the parameters in the hidden layers are randomly generated in uniform distribution, hypothetically, there is optimal randomness. Heavy-tailed distribution has shown optimal randomness in an unknown environment for finding some targets. Therefore, in this research, the authors used heavy-tailed distributions to randomly initialize weights and biases to see if the new SCN models can achieve better performance than the original SCN. Heavy-tailed distributions, such as Lévy distribution, Cauchy distribution, and Weibull distribution, have been used. Since some mixed distributions show heavy-tailed properties, the mixed Gaussian and Laplace distributions were also studied in this research work. Experimental results showed improved performance for SCN with heavy-tailed distributions. For the regression model, SCN-Lévy, SCN-Mixture, SCN-Cauchy, and SCN-Weibull used less hidden nodes to achieve similar performance with SCN. For the classification model, SCN-Mixture, SCN-Lévy, and SCN-Cauchy have higher test accuracy of 91.5%, 91.7% and 92.4%, respectively. Both are higher than the test accuracy of the original SCN.

Download Full-text

Exploring the octanol–water partition coefficient dataset using deep learning techniques and data augmentation

Communications Chemistry ◽

10.1038/s42004-021-00528-9 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Nadin Ulrich ◽

Kai-Uwe Goss ◽

Andrea Ebert

Keyword(s):

Partition Coefficient ◽

Data Augmentation ◽

Environmental Chemistry ◽

Chemical Properties ◽

Predictive Performance ◽

Chemical Structures ◽

Tautomeric Forms ◽

Learning Techniques ◽

Water Partition Coefficient ◽

Using Data

AbstractToday more and more data are freely available. Based on these big datasets deep neural networks (DNNs) rapidly gain relevance in computational chemistry. Here, we explore the potential of DNNs to predict chemical properties from chemical structures. We have selected the octanol-water partition coefficient (log P) as an example, which plays an essential role in environmental chemistry and toxicology but also in chemical analysis. The predictive performance of the developed DNN is good with an rmse of 0.47 log units in the test dataset and an rmse of 0.33 for an external dataset from the SAMPL6 challenge. To this end, we trained the DNN using data augmentation considering all potential tautomeric forms of the chemicals. We further demonstrate how DNN models can help in the curation of the log P dataset by identifying potential errors, and address limitations of the dataset itself.

Download Full-text

Image Classification for the Automatic Feature Extraction in Human Worn Fashion Data

Mathematics ◽

10.3390/math9060624 ◽

2021 ◽

Vol 9 (6) ◽

pp. 624

Author(s):

Stefan Rohrmanstorfer ◽

Mikhail Komarov ◽

Felix Mödritscher

Keyword(s):

Neural Networks ◽

Feature Extraction ◽

Image Classification ◽

Convolutional Neural Networks ◽

Data Augmentation ◽

State Of The Art ◽

Image Data ◽

Classification Model ◽

Upper Body ◽

Automatic Feature Extraction

With the always increasing amount of image data, it has become a necessity to automatically look for and process information in these images. As fashion is captured in images, the fashion sector provides the perfect foundation to be supported by the integration of a service or application that is built on an image classification model. In this article, the state of the art for image classification is analyzed and discussed. Based on the elaborated knowledge, four different approaches will be implemented to successfully extract features out of fashion data. For this purpose, a human-worn fashion dataset with 2567 images was created, but it was significantly enlarged by the performed image operations. The results show that convolutional neural networks are the undisputed standard for classifying images, and that TensorFlow is the best library to build them. Moreover, through the introduction of dropout layers, data augmentation and transfer learning, model overfitting was successfully prevented, and it was possible to incrementally improve the validation accuracy of the created dataset from an initial 69% to a final validation accuracy of 84%. More distinct apparel like trousers, shoes and hats were better classified than other upper body clothes.

Download Full-text

Identifying of biomarkers associated with gastric cancer based on 11 topological analysis methods of CytoHubba

Scientific Reports ◽

10.1038/s41598-020-79235-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Hua Ma ◽

Zhihui He ◽

Jing Chen ◽

Xu Zhang ◽

Pingping Song

Keyword(s):

Gastric Cancer ◽

Protein Interactions ◽

Clustering Algorithm ◽

Cox Model ◽

Topological Analysis ◽

Lasso Regression ◽

Ppi Network ◽

Hub Genes ◽

P53 Signaling ◽

Analysis Methods

AbstractGastric cancer (GC) is one of the most common types of malignancy. Its potential molecular mechanism has not been clarified. In this study, we aimed to explore potential biomarkers and prognosis-related hub genes associated with GC. The gene chip dataset GSE79973 was downloaded from the GEO datasets and limma package was used to identify the differentially expressed genes (DEGs). A total of 1269 up-regulated and 330 down-regulated genes were identified. The protein-protein interactions (PPI) network of DEGs was constructed by STRING V11 database, and 11 hub genes were selected through intersection of 11 topological analysis methods of CytoHubba in Cytoscape plug-in. All the 11 selected hub genes were found in the module with the highest score from PPI network of all DEGs by the molecular complex detection (MCODE) clustering algorithm. In order to explore the role of the 11 hub genes, we performed GO function and KEGG pathway analysis for them and found that the genes were enriched in a variety of functions and pathways among which cellular senescence, cell cycle, viral carcinogenesis and p53 signaling pathway were the most associated with GC. Kaplan-Meier analysis revealed that 10 out of the 11 hub genes were related to the overall survival of GC patients. Further, seven of the 11 selected hub genes were verified significantly correlated with GC by uni- or multivariable Cox model and LASSO regression analysis including C3, CDK1, FN1, CCNB1, CDC20, BUB1B and MAD2L1. C3, CDK1, FN1, CCNB1, CDC20, BUB1B and MAD2L1 may serve as potential prognostic biomarkers and therapeutic targets for GC.

Download Full-text