SCMTHP: A New Approach for Identifying and Characterizing of Tumor-Homing Peptides Using Estimated Propensity Scores of Amino Acids

Phasit Charoenkwan; Wararat Chiangjong; Chanin Nantasenamat; Mohammad Ali Moni; Pietro Lio’; Balachandran Manavalan; Watshara Shoombuatong

doi:10.3390/pharmaceutics14010122

SCMTHP: A New Approach for Identifying and Characterizing of Tumor-Homing Peptides Using Estimated Propensity Scores of Amino Acids

Pharmaceutics ◽

10.3390/pharmaceutics14010122 ◽

2022 ◽

Vol 14 (1) ◽

pp. 122

Author(s):

Phasit Charoenkwan ◽

Wararat Chiangjong ◽

Chanin Nantasenamat ◽

Mohammad Ali Moni ◽

Pietro Lio’ ◽

...

Keyword(s):

Amino Acids ◽

Propensity Scores ◽

Nearest Neighbor ◽

Biochemical Properties ◽

Least Squares Regression ◽

K Nearest Neighbor ◽

Accurate Identification ◽

Major Drawback ◽

Benchmark Datasets ◽

Tumor Homing

Tumor-homing peptides (THPs) are small peptides that can recognize and bind cancer cells specifically. To gain a better understanding of THPs’ functional mechanisms, the accurate identification and characterization of THPs is required. Although some computational methods for in silico THP identification have been proposed, a major drawback is their lack of model interpretability. In this study, we propose a new, simple and easily interpretable computational approach (called SCMTHP) for identifying and analyzing tumor-homing activities of peptides via the use of a scoring card method (SCM). To improve the predictability and interpretability of our predictor, we generated propensity scores of 20 amino acids as THPs. Finally, informative physicochemical properties were used for providing insights on characteristics giving rise to the bioactivity of THPs via the use of SCMTHP-derived propensity scores. Benchmarking experiments from independent test indicated that SCMTHP could achieve comparable performance to state-of-the-art method with accuracies of 0.827 and 0.798, respectively, when evaluated on two benchmark datasets consisting of Main and Small datasets. Furthermore, SCMTHP was found to outperform several well-known machine learning-based classifiers (e.g., decision tree, k-nearest neighbor, multi-layer perceptron, naive Bayes and partial least squares regression) as indicated by both 10-fold cross-validation and independent tests. Finally, the SCMTHP web server was established and made freely available online. SCMTHP is expected to be a useful tool for rapid and accurate identification of THPs and for providing better understanding on THP biophysical and biochemical properties.

Quantitative Detection of Chromium Pollution in Biochar Based on Matrix Effect Classification Regression Model

Molecules ◽

10.3390/molecules26072069 ◽

2021 ◽

Vol 26 (7) ◽

pp. 2069

Author(s):

Mei Guo ◽

Rongguang Zhu ◽

Lixin Zhang ◽

Ruoyu Zhang ◽

Guangqun Huang ◽

...

Keyword(s):

Regression Model ◽

Matrix Effect ◽

Nearest Neighbor ◽

Quantitative Detection ◽

Least Squares Regression ◽

K Nearest Neighbor ◽

Breakdown Spectroscopy ◽

Relative Standard ◽

Calibration Samples ◽

Standard Deviations

Returning biochar to farmland has become one of the nationally promoted technologies for soil remediation and improvement in China. Rapid detection of heavy metals in biochar derived from varied materials can provide a guarantee for contaminated soil, avoiding secondary pollution. This work aims first to apply laser-induced breakdown spectroscopy (LIBS) for the quantitative detection of Cr in biochar. Learning from the principles of traditional matrix effect correction methods, calibration samples were divided into 1–3 classifications by an unsupervised hierarchical clustering method based on the main elemental LIBS data in biochar. The prediction samples were then divided into diverse classifications of calibration samples by a supervised K-nearest neighbor (KNN) algorithm. By comparing the effects of multiple partial least squares regression (PLSR) models, the results show that larger numbered classifications have a lower averaged relative standard deviations of cross-validation (ARSDCV) value, signifying a better calibration performance. Therefore, the 3 classification regression model was employed in this study, which had a better prediction performance with a lower averaged relative standard deviations of prediction (ARSDP) value of 8.13%, in comparison with our previous research and related literature results. The LIBS technology combined with matrix effect classification regression model can weaken the influence of the complex matrix effect of biochar and achieve accurate quantification of contaminated metal Cr in biochar.

k-Nearest Neighbor Learning with Graph Neural Networks

Mathematics ◽

10.3390/math9080830 ◽

2021 ◽

Vol 9 (8) ◽

pp. 830

Author(s):

Seokho Kang

Keyword(s):

Neural Network ◽

Nearest Neighbor ◽

Learning Algorithm ◽

Weighting Function ◽

High Sensitivity ◽

Training Data ◽

K Nearest Neighbor ◽

Main Challenge ◽

Benchmark Datasets ◽

Graph Neural Networks

k-nearest neighbor (kNN) is a widely used learning algorithm for supervised learning tasks. In practice, the main challenge when using kNN is its high sensitivity to its hyperparameter setting, including the number of nearest neighbors k, the distance function, and the weighting function. To improve the robustness to hyperparameters, this study presents a novel kNN learning method based on a graph neural network, named kNNGNN. Given training data, the method learns a task-specific kNN rule in an end-to-end fashion by means of a graph neural network that takes the kNN graph of an instance to predict the label of the instance. The distance and weighting functions are implicitly embedded within the graph neural network. For a query instance, the prediction is obtained by performing a kNN search from the training data to create a kNN graph and passing it through the graph neural network. The effectiveness of the proposed method is demonstrated using various benchmark datasets for classification and regression tasks.

WINkNN: Windowed Intervals’ Number kNN Classifier for Efficient Time-Series Applications

Mathematics ◽

10.3390/math8030413 ◽

2020 ◽

Vol 8 (3) ◽

pp. 413 ◽

Cited By ~ 2

Author(s):

Chris Lytridis ◽

Anna Lekova ◽

Christos Bazinas ◽

Michail Manios ◽

Vassilis G. Kaburlasos

Keyword(s):

Time Series ◽

Ad Hoc ◽

Nearest Neighbor ◽

Classification Performance ◽

Human Robot Interaction ◽

Time Series Classification ◽

K Nearest Neighbor ◽

Time Dimension ◽

Knn Classifier ◽

Benchmark Datasets

Our interest is in time series classification regarding cyber–physical systems (CPSs) with emphasis in human-robot interaction. We propose an extension of the k nearest neighbor (kNN) classifier to time-series classification using intervals’ numbers (INs). More specifically, we partition a time-series into windows of equal length and from each window data we induce a distribution which is represented by an IN. This preserves the time dimension in the representation. All-order data statistics, represented by an IN, are employed implicitly as features; moreover, parametric non-linearities are introduced in order to tune the geometrical relationship (i.e., the distance) between signals and consequently tune classification performance. In conclusion, we introduce the windowed IN kNN (WINkNN) classifier whose application is demonstrated comparatively in two benchmark datasets regarding, first, electroencephalography (EEG) signals and, second, audio signals. The results by WINkNN are superior in both problems; in addition, no ad-hoc data preprocessing is required. Potential future work is discussed.

DIMENSIONALITY REDUCTION FOR SENSORY DATASETS BASED ON MASTER–SLAVE SYNCHRONIZATION OF LORENZ SYSTEM

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127413300139 ◽

2013 ◽

Vol 23 (05) ◽

pp. 1330013 ◽

Cited By ~ 1

Author(s):

REZA GHAFFARI ◽

IOAN GROSU ◽

DACIANA ILIESCU ◽

EVOR HINES ◽

MARK LEESON

Keyword(s):

Breast Cancer ◽

Artificial Intelligence ◽

Electronic Nose ◽

Lorenz System ◽

Nearest Neighbor ◽

Performance Testing ◽

K Nearest Neighbor ◽

Classification Rate ◽

Benchmark Datasets ◽

Novel Method

In this study, we propose a novel method for reducing the attributes of sensory datasets using Master–Slave Synchronization of chaotic Lorenz Systems (DPSMS). As part of the performance testing, three benchmark datasets and one Electronic Nose (EN) sensory dataset with 3 to 13 attributes were presented to our algorithm to be projected into two attributes. The DPSMS-processed datasets were then used as input vector to four artificial intelligence classifiers, namely Feed-Forward Artificial Neural Networks (FFANN), Multilayer Perceptron (MLP), Decision Tree (DT) and K-Nearest Neighbor (KNN). The performance of the classifiers was then evaluated using the original and reduced datasets. Classification rate of 94.5%, 89%, 94.5% and 82% were achieved when reduced Fishers iris, crab gender, breast cancer and electronic nose test datasets were presented to the above classifiers.

Employing Divergent Machine Learning Classifiers to Upgrade the Preciseness of Image Retrieval Systems

Cybernetics and Information Technologies ◽

10.2478/cait-2020-0029 ◽

2020 ◽

Vol 20 (3) ◽

pp. 75-85

Author(s):

Shefali Dhingra ◽

Poonam Bansal

Keyword(s):

Machine Learning ◽

Image Retrieval ◽

Nearest Neighbor ◽

Machine Learning Algorithms ◽

Visual Features ◽

Support Vector ◽

K Nearest Neighbor ◽

Retrieval Systems ◽

Retrieval Efficiency ◽

Benchmark Datasets

AbstractContent Based Image Retrieval (CBIR) system is an efficient search engine which has the potentiality of retrieving the images from huge repositories by extracting the visual features. It includes color, texture and shape. Texture is the most eminent feature among all. This investigation focuses upon the classification complications that crop up in case of big datasets. In this, texture techniques are explored with machine learning algorithms in order to increase the retrieval efficiency. We have tested our system on three texture techniques using various classifiers which are Support vector machine, K-Nearest Neighbor (KNN), Naïve Bayes and Decision Tree (DT). Variant evaluation metrics precision, recall, false alarm rate, accuracy etc. are figured out to measure the competence of the designed CBIR system on two benchmark datasets, i.e. Wang and Brodatz. Result shows that with both these datasets the KNN and DT classifier hand over superior results as compared to others.

Evaluation of Arabian Vascular Plant Barcodes (rbcL and matK): Precision of Unsupervised and Supervised Learning Methods towards Accurate Identification

Plants ◽

10.3390/plants10122741 ◽

2021 ◽

Vol 10 (12) ◽

pp. 2741

Author(s):

Rahul Jamdade ◽

Maulik Upadhyay ◽

Khawla Al Shaer ◽

Eman Al Harthi ◽

Mariam Al Sallani ◽

...

Keyword(s):

Supervised Learning ◽

Species Identification ◽

Vascular Plants ◽

Nearest Neighbor ◽

Vascular Plant ◽

K Nearest Neighbor ◽

Accurate Identification ◽

The Public ◽

Unsupervised Method ◽

Supervised Methods

Arabia is the largest peninsula in the world, with >3000 species of vascular plants. Not much effort has been made to generate a multi-locus marker barcode library to identify and discriminate the recorded plant species. This study aimed to determine the reliability of the available Arabian plant barcodes (>1500; rbcL and matK) at the public repository (NCBI GenBank) using the unsupervised and supervised methods. Comparative analysis was carried out with the standard dataset (FINBOL) to assess the methods and markers’ reliability. Our analysis suggests that from the unsupervised method, TaxonDNA’s All Species Barcode criterion (ASB) exhibits the highest accuracy for rbcL barcodes, followed by the matK barcodes using the aligned dataset (FINBOL). However, for the Arabian plant barcode dataset (GBMA), the supervised method performed better than the unsupervised method, where the Random Forest and K-Nearest Neighbor (gappy kernel) classifiers were robust enough. These classifiers successfully recognized true species from both barcode markers belonging to the aligned and alignment-free datasets, respectively. The multi-class classifier showed high species resolution following the two classifiers, though its performance declined when employed to recognize true species. Similar results were observed for the FINBOL dataset through the supervised learning approach; overall, matK marker showed higher accuracy than rbcL. However, the lower rate of species identification in matK in GBMA data could be due to the higher evolutionary rate or gaps and missing data, as observed for the ASB criterion in the FINBOL dataset. Further, a lower number of sequences and singletons could also affect the rate of species resolution, as observed in the GBMA dataset. The GBMA dataset lacks sufficient species membership. We would encourage the taxonomists from the Arabian Peninsula to join our campaign on the Arabian Barcode of Life at the Barcode of Life Data (BOLD) systems. Our efforts together could help improve the rate of species identification for the Arabian Vascular plants.

Prediction of Citrullination Sites on the Basis of mRMR Method and SNN

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666191129113508 ◽

2020 ◽

Vol 22 (10) ◽

pp. 705-715 ◽

Cited By ~ 2

Author(s):

Min Liu ◽

Guangzhong Liu

Keyword(s):

Prediction Model ◽

Evaluation Method ◽

Nearest Neighbor ◽

Protein Sequences ◽

Support Vector ◽

Post Translational Modification ◽

K Nearest Neighbor ◽

Accurate Identification ◽

Citrullinated Proteins ◽

K Nearest Neighbor Algorithm

Background: Citrullination, an important post-translational modification of proteins, alters the molecular weight and electrostatic charge of the protein side chains. Citrulline, in protein sequences, is catalyzed by a class of Peptidyl Arginine Deiminases (PADs). Dependent on Ca2+, PADs include five isozymes: PAD 1, 2, 3, 4/5, and 6. Citrullinated proteins have been identified in many biological and pathological processes. Among them, abnormal protein citrullination modification can lead to serious human diseases, including multiple sclerosis and rheumatoid arthritis. Objective: It is important to identify the citrullination sites in protein sequences. The accurate identification of citrullination sites may contribute to the studies on the molecular functions and pathological mechanisms of related diseases. Methods and Results: In this study, after an encoded training set (containing 116 positive and 348 negative samples) into the feature matrix, the mRMR method was used to analyze the 941- dimensional features which were sorted on the basis of their importance. Then, a predictive model based on a self-normalizing neural network (SNN) was proposed to predict the citrullination sites in protein sequences. Incremental Feature Selection (IFS) and 10-fold cross-validation were used as the model evaluation method. Three classical machine learning models, namely random forest, support vector machine, and k-nearest neighbor algorithm, were selected and compared with the SNN prediction model using the same evaluation methods. SNN may be the best tool for citrullination site prediction. The maximum value of the Matthews Correlation Coefficient (MCC) reached 0.672404 on the basis of the optimal classifier of SNN. Conclusion: The results showed that the SNN-based prediction methods performed better when evaluated by some common metrics, such as MCC, accuracy, and F1-Measure. SNN prediction model also achieved a better balance in the classification and recognition of positive and negative samples from datasets compared with the other three models.

POLYNOMIAL NETWORKS VERSUS OTHER TECHNIQUES IN TEXT CATEGORIZATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001408006247 ◽

2008 ◽

Vol 22 (02) ◽

pp. 295-322 ◽

Cited By ~ 5

Author(s):

MAYY M. AL-TAHRAWI ◽

RAED ABU ZITAR

Keyword(s):

High Performance ◽

Text Categorization ◽

Nearest Neighbor ◽

Classification Performance ◽

Support Vector ◽

K Nearest Neighbor ◽

New Techniques ◽

Vector Machines ◽

Benchmark Datasets ◽

Automatic Text

Many techniques and algorithms for automatic text categorization had been devised and proposed in the literature. However, there is still much space for researchers in this area to improve existing algorithms or come up with new techniques for text categorization (TC). Polynomial Networks (PNs) were never used before in TC. This can be attributed to the huge datasets used in TC, as well as the technique itself which has high computational demands. In this paper, we investigate and propose using PNs in TC. The proposed PN classifier has achieved a competitive classification performance in our experiments. More importantly, this high performance is achieved in one shot training (noniteratively) and using just 0.25%–0.5% of the corpora features. Experiments are conducted on the two benchmark datasets in TC: Reuters-21578 and the 20 Newsgroups. Five well-known classifiers are experimented on the same data and feature subsets: the state-of-the-art Support Vector Machines (SVM), Logistic Regression (LR), the k-nearest-neighbor (kNN), Naive Bayes (NB), and the Radial Basis Function (RBF) networks.

Detection of Microaneurysms Using Grey Wolf Optimization for Early Diagnosis of Diabetic Retinopathy

International Journal of Intelligent Engineering and Systems ◽

10.22266/ijies2020.1231.19 ◽

2020 ◽

Vol 13 (6) ◽

pp. 208-218

Author(s):

Manohar Pundikal ◽

◽

Mallikarjun Holi ◽

Keyword(s):

Diabetic Retinopathy ◽

Early Diagnosis ◽

Nearest Neighbor ◽

Fundus Images ◽

Grey Wolf ◽

K Nearest Neighbor ◽

Grey Wolf Optimization ◽

Low Light ◽

Accurate Identification ◽

Proposed Model

The diabetic retinopathy is the leading cause of blindness worldwide, so early detection of diabetic retinopathy is necessary to reduce eye-related diseases. The accurate identification of microaneurysms is crucial for the detection of diabetic retinopathy, because it appears as the first sign of the disease. In this study, a new model is proposed to detect microaneurysms from the retinal images for early diagnosis of diabetic retinopathy. At first, the fundus images are collected from e-ophtha microaneurysms and DiaretDB1 datasets. Next, image pre-processing is accomplished using image normalization, low light image enhancement, gradient weighting and shade correction. The pre-processing methods significantly brighten the contrast of the fundus images for better visual quality and extract the hidden details of the dark conditions. In addition, Hessian based filter and Otsu threshold are used to extract the foreground objects like microaneurysms from the enhanced fundus images. At last, Grey Wolf Optimization (GWO) is used to predict the correctness of segmented microaneurysms candidates. The experimental results have revealed that the proposed model enhanced the microaneurysms detection up to 0.06-0.30 f-score value compared to the other existing models local convergence index features and local features with k-nearest neighbor. In addition, the proposed model has achieved 85.72% and 86.16% of accuracy respectively on e-ophtha microaneurysms and DiaretDB1 datasets.

VirionFinder: Identification of Complete and Partial Prokaryote Virus Virion Protein From Virome Data Using the Sequence and Biochemical Properties of Amino Acids

Frontiers in Microbiology ◽

10.3389/fmicb.2021.615711 ◽

2021 ◽

Vol 12 ◽

Author(s):

Zhencheng Fang ◽

Hongwei Zhou

Keyword(s):

Amino Acids ◽

Functional Annotation ◽

Recognition Rate ◽

Biochemical Properties ◽

Metagenomic Data ◽

Virion Protein ◽

Species Classification ◽

Viral Community ◽

Benchmark Datasets ◽

Virus Proteins

Viruses are some of the most abundant biological entities on Earth, and prokaryote virus are the dominant members of the viral community. Because of the diversity of prokaryote virus, functional annotation cannot be performed on a large number of genes from newly discovered prokaryote virus by searching the current database; therefore, the development of an alignment-free algorithm for functional annotation of prokaryote virus proteins is important to understand the viral community. The identification of prokaryote virus proteins (PVVPs) is a critical step for many viral analyses, such as species classification, phylogenetic analysis and the exploration of how prokaryote virus interact with their hosts. Although a series of PVVP prediction tools have been developed, the performance of these tools is still not satisfactory. Moreover, viral metagenomic data contains fragmented sequences, leading to the existence of some incomplete genes. Therefore, a tool that can identify partial prokaryote virus proteins is also needed. In this work, we present a novel algorithm, called VirionFinder, to identify the complete and partial PVVPs from non-prokaryote virus virion proteins (non-PVVPs). VirionFinder uses the sequence and biochemical properties of 20 amino acids as the mathematical model to encode the protein sequences and uses a deep learning technique to identify whether a given protein is a PVVP. Compared with the state-of-the-art tools using artificial benchmark datasets, the results show that under the same specificity (Sp), the sensitivity (Sn) of VirionFinder is approximately 10–34% much higher than the Sn of these tools on both complete and partial proteins. When evaluating related tools using real virome data, the recognition rate of PVVP-like sequences of VirionFinder is also much higher than that of the other tools. We expect that VirionFinder will be a powerful tool for identifying novel virion proteins from both complete prokaryote virus genomes and viral metagenomic data. VirionFinder is freely available at https://github.com/zhenchengfang/VirionFinder.