Separating Phages From Other Virus Families and Classifying the Different Phage Families By GI-Clusters

Author(s):  
Xingang Jia ◽  
Qiuhong Han ◽  
Zuhong Lu

Abstract Background: Phages are the most abundant biological entities, but the commonly used clustering techniques are difficult to separate them from other virus families and classify the different phage families together.Results: This work uses GI-clusters to separate phages from other virus families and classify the different phage families, where GI-clusters are constructed by GI-features, GI-features are constructed by the togetherness with F-features, training data, MG-Euclidean and Icc-cluster algorithms, F-features are the frequencies of multiple-nucleotides that are generated from genomes of viruses, MG-Euclidean algorithm is able to put the nearest neighbors in the same mini-groups, and Icc-cluster algorithm put the distant samples to the different mini-clusters. For these viruses that the maximum element of their GI-features are in the same locations, they are put to the same GI-clusters, where the families of viruses in test data are identified by GI-clusters, and the families of GI-clusters are defined by viruses of training data.Conclusions: From analysis of 4 data sets that are constructed by the different family viruses, we demonstrate that GI-clusters are able to separate phages from other virus families, correctly classify the different phage families, and correctly predict the families of these unknown phages also.

2021 ◽  
Vol 10 (1) ◽  
pp. 105
Author(s):  
I Gusti Ayu Purnami Indryaswari ◽  
Ida Bagus Made Mahendra

Many Indonesian people, especially in Bali, make pigs as livestock. Pig livestock are susceptible to various types of diseases and there have been many cases of pig deaths due to diseases that cause losses to breeders. Therefore, the author wants to create an Android-based application that can predict the type of disease in pigs by applying the C4.5 Algorithm. The C4.5 algorithm is an algorithm for classifying data in order to obtain a rule that is used to predict something. In this study, 50 training data sets were used with 8 types of diseases in pigs and 31 symptoms of disease. which is then inputted into the system so that the data is processed so that the system in the form of an Android application can predict the type of disease in pigs. In the testing process, it was carried out by testing 15 test data sets and producing an accuracy value that is 86.7%. In testing the application features built using the Kotlin programming language and the SQLite database, it has been running as expected.


2021 ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.


Author(s):  
Peter Gemmar

AbstractThe pandemic spread of coronavirus leads to increased burden on healthcare services worldwide. Experience shows that required medical treatment can reach limits at local clinics and fast and secure clinical assessment of the disease severity becomes vital. In [1] a model is presented for predicting the mortality of COVID-19 patients from their biomarkers. Three biomarkers have been selected by ranking with a supervised Multi-tree XGBoost classifier. The prediction model is built up as a binary decision tree with depth three and achieves AUC scores of up to 97.84±0.37 and 95.06± 2.21 for training and external test data sets, resp.In human assessment and decision making influencing parameters usually aren’t considered as sharp numbers but rather as Fuzzy terms [2], and inferencing primarily yields Fuzzy terms or continuous grades rather than binary decisions. Therefore, I examined a Sugenotype Fuzzy classifier [3] for disease assessment and decision support. In addition, I used an artificial neural network (SOM, [4]) for selecting the biomarkers. Modelling and validation was done with the identical data base provided by [1]. With the complete training and test data sets, the Fuzzy prediction model achieves improved AUC scores of up to 98.59 or 95.12 The improvements with the Fuzzy classifier obviously become clear as physicians can interpret output grades to belong to positive or negative class more or less strongly. An extension of the Fuzzy model, which takes into account the trend in key features over time, provides excellent results with the training data, which, however, could not be finally verified due to the lack of suitable test data. The generation and training of the Fuzzy models was fully automatic and without additional adjustment with the help of ANFIS from Matlab©.


2019 ◽  
Vol 5 ◽  
pp. e194 ◽  
Author(s):  
Hyukjun Gweon ◽  
Matthias Schonlau ◽  
Stefan H. Steiner

The k nearest neighbor (kNN) approach is a simple and effective nonparametric algorithm for classification. One of the drawbacks of kNN is that the method can only give coarse estimates of class probabilities, particularly for low values of k. To avoid this drawback, we propose a new nonparametric classification method based on nearest neighbors conditional on each class: the proposed approach calculates the distance between a new instance and the kth nearest neighbor from each class, estimates posterior probabilities of class memberships using the distances, and assigns the instance to the class with the largest posterior. We prove that the proposed approach converges to the Bayes classifier as the size of the training data increases. Further, we extend the proposed approach to an ensemble method. Experiments on benchmark data sets show that both the proposed approach and the ensemble version of the proposed approach on average outperform kNN, weighted kNN, probabilistic kNN and two similar algorithms (LMkNN and MLM-kHNN) in terms of the error rate. A simulation shows that kCNN may be useful for estimating posterior probabilities when the class distributions overlap.


2021 ◽  
pp. 002203452110357
Author(s):  
T. Chen ◽  
P.D. Marsh ◽  
N.N. Al-Hebshi

An intuitive, clinically relevant index of microbial dysbiosis as a summary statistic of subgingival microbiome profiles is needed. Here, we describe a subgingival microbial dysbiosis index (SMDI) based on machine learning analysis of published periodontitis/health 16S microbiome data. The raw sequencing data, split into training and test sets, were quality filtered, taxonomically assigned to the species level, and centered log-ratio transformed. The training data set was subject to random forest analysis to identify discriminating species (DS) between periodontitis and health. DS lists, compiled by various “Gini” importance score cutoffs, were used to compute the SMDI for samples in the training and test data sets as the mean centered log-ratio abundance of periodontitis-associated species subtracted by that of health-associated ones. Diagnostic accuracy was assessed with receiver operating characteristic analysis. An SMDI based on 49 DS provided the highest accuracy with areas under the curve of 0.96 and 0.92 in the training and test data sets, respectively, and ranged from −6 (most normobiotic) to 5 (most dysbiotic) with a value around zero discriminating most of the periodontitis and healthy samples. The top periodontitis-associated DS were Treponema denticola, Mogibacterium timidum, Fretibacterium spp., and Tannerella forsythia, while Actinomyces naeslundii and Streptococcus sanguinis were the top health-associated DS. The index was highly reproducible by hypervariable region. Applying the index to additional test data sets in which nitrate had been used to modulate the microbiome demonstrated that nitrate has dysbiosis-lowering properties in vitro and in vivo. Finally, 3 genera ( Treponema, Fretibacterium, and Actinomyces) were identified that could be used for calculation of a simplified SMDI with comparable accuracy. In conclusion, we have developed a nonbiased, reproducible, and easy-to-interpret index that can be used to identify patients/sites at risk of periodontitis, to assess the microbial response to treatment, and, importantly, as a quantitative tool in microbiome modulation studies.


Geophysics ◽  
2010 ◽  
Vol 75 (3) ◽  
pp. P11-P22 ◽  
Author(s):  
Hendrik Paasche ◽  
Jens Tronicke ◽  
Peter Dietrich

Partitioning cluster analyses are powerful tools for rapidly and objectively exploring and characterizing disparate geophysical databases with unknown interrelations between individual data sets or models. Despite its high potential to objectively extract the dominant structural information from suites of disparate geophysical data sets or models, cluster-analysis techniques are underused when analyzing geophysical data or models. This is due to the following limitations regarding the applicability of standard partitioning cluster algorithms to geophysical databases: The considered survey or model area must be fully covered by all data sets; cluster algorithms classify data in a multidimensional parameter space while ignoring spatial information present in the databases and are therefore sensitive to high-frequency spatial noise (outliers); and standard cluster algorithms such asfuzzy [Formula: see text]-means (FCM) or crisp [Formula: see text]-means classify data in an unsupervised manner, potentially ignoring expert knowledge additionally available to the experienced human interpreter. We address all of these issues by considering recent modifications to the standard FCM cluster algorithm to tolerate incomplete databases, i.e., survey or model areas not covered by all available data sets, and to consider spatial information present in the database. We have evaluated the regularized missing-value FCM cluster algorithm in a synthetic study and applied it to a database comprising partially colocated crosshole tomographic P- and S-wave-velocity models. Additionally, we were able to demonstrate how further expert knowledge can be incorporated in the cluster analysis to obtain a multiparameter geophysical model to objectively outline the dominant subsurface units, explaining all available geoscientific information.


2006 ◽  
Vol 26 ◽  
pp. 101-126 ◽  
Author(s):  
H. Daume III ◽  
D. Marcu

The most basic assumption used in statistical learning theory is that training data and test data are drawn from the same underlying distribution. Unfortunately, in many applications, the "in-domain" test data is drawn from a distribution that is related, but not identical, to the "out-of-domain" distribution of the training data. We consider the common case in which labeled out-of-domain data is plentiful, but labeled in-domain data is scarce. We introduce a statistical formulation of this problem in terms of a simple mixture model and present an instantiation of this framework to maximum entropy classifiers and their linear chain counterparts. We present efficient inference algorithms for this special case based on the technique of conditional expectation maximization. Our experimental results show that our approach leads to improved performance on three real world tasks on four different data sets from the natural language processing domain.


1991 ◽  
Vol 9 (5) ◽  
pp. 871-876 ◽  
Author(s):  
M J Ratain ◽  
J Robert ◽  
W J van der Vijgh

Although doxorubicin is one of the most commonly used antineoplastics, no studies to date have clearly related the area under the concentration-time curve (AUC) to toxicity or response. The limited sampling model has recently been shown to be a feasible method for estimating the AUC to facilitate pharmacodynamic studies. Data from two previous studies of doxorubicin pharmacokinetics were used, including 26 patients with sarcoma and five patients with breast cancer or unknown primary. The former were divided into a training data set of 15 patients and a test datum set of 11 patients, and the latter patients formed a second test data set. The model was developed by stepwise multiple regression on the training data set: AUC (nanogram hour per milliliter) = 17.39 C2 + 163 C48-111.0 [dose/(50 mg/m2)], where C2 and C48 are the concentrations at 2 and 48 hours after bolus dose. The model was subsequently validated on both test data sets: first test data set--mean predictive error (MPE), 4.7%; root mean square error (RMSE), 12.4%; second test data set--MPE, 4.5%, RMSE, 9.2%. An additional model was also generated using a simulated time point to estimate the total AUC for a daily x 3-day schedule: AUC (nanogram hour per milliliter) = 44.79 C2 + 175.65 C48 + 47.25 [dose/(25 mg/m2/d)], where the C48 is obtained just prior to the third dose. We conclude that the AUC of doxorubicin after bolus administration can be adequately estimated from two timed plasma concentrations.


2020 ◽  
Vol 10 (1) ◽  
pp. 55-60
Author(s):  
Owais Mujtaba Khanday ◽  
Samad Dadvandipour

Deep Neural Networks (DNN) in the past few years have revolutionized the computer vision by providing the best results on a large number of problems such as image classification, pattern recognition, and speech recognition. One of the essential models in deep learning used for image classification is convolutional neural networks. These networks can integrate a different number of features or so-called filters in a multi-layer fashion called convolutional layers. These models use convolutional, and pooling layers for feature abstraction and have neurons arranged in three dimensions: Height, Width, and Depth. Filters of 3 different sizes were used like 3×3, 5×5 and 7×7. It has been seen that the accuracy on the training data has been decreased from 100% to 97.8% as we increase the filter size and also the accuracy on the test data set decreases for 3×3 it is 98.7%, for 5×5 it is 98.5%, and for 7×7 it is 97.8%. The loss on the training data and test data per 10 epochs could be seen drastically increasing from 3.4% to 27.6% and 12.5% to 23.02%, respectively. Thus it is clear that using the filters having lesser dimensions is giving less loss than those having more dimensions. However, using the smaller filter size comes with the cost of computational complexity, which is very crucial in the case of larger data sets.


2021 ◽  
Author(s):  
Hye-Won Hwang ◽  
Jun-Ho Moon ◽  
Min-Gyu Kim ◽  
Richard E. Donatelli ◽  
Shin-Jae Lee

ABSTRACT Objectives To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI). Materials and Methods This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR). Results SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures. Conclusions This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.


Sign in / Sign up

Export Citation Format

Share Document