scholarly journals Machine learning to predict the source of campylobacteriosis using whole genome data

PLoS Genetics ◽  
2021 ◽  
Vol 17 (10) ◽  
pp. e1009436
Author(s):  
Nicolas Arning ◽  
Samuel K. Sheppard ◽  
Sion Bayliss ◽  
David A. Clifton ◽  
Daniel J. Wilson

Campylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using the classifier we named aiSource. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.

2021 ◽  
Author(s):  
Nicolas Arning ◽  
Samuel K. Sheppard ◽  
David A. Clifton ◽  
Daniel J. Wilson

AbstractCampylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using machine learning. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.Author summaryC. jejuni are the most common cause of food-borne bacterial gastroenteritis but the relative contribution of different sources are incompletely understood. We traced the origin of human C. jejuni infections using machine learning algorithms that compare the DNA sequences of bacteria sampled from infected people, contaminated chickens, cattle, sheep, wild birds and the environment. This approach achieved improvement in accuracy of source attribution by 33% over existing methods that use only a subset of genes within the genome and provided evidence for the relative contribution of different infection sources. Sometimes even very similar bacteria showed differences, demonstrating the value of basing analyses on the entire genome when developing this algorithm that can be used for understanding the global epidemiology and other important bacterial infections.


Diabetes is a most common disease that occurs to most of the humans now a day. The predictions for this disease are proposed through machine learning techniques. Through this method the risk factors of this disease are identified and can be prevented from increasing. Early prediction in such disease can be controlled and save human’s life. For the early predictions of this disease we collect data set having 8 attributes diabetic of 200 patients. The patients’ sugar level in the body is tested by the features of patient’s glucose content in the body and according to the age. The main Machine learning algorithms are Support vector machine (SVM), naive bayes (NB), K nearest neighbor (KNN) and Decision Tree (DT). In the exiting the Naive Bayes the accuracy levels are 66% but in the Decision tree the accuracy levels are 70 to 71%. The accuracy levels of the patients are not proper in range. But in XG boost classifiers even after the Naïve Bayes 74 Percentage and in Decision tree the accuracy levels are 89 to 90%. In the proposed system the accuracy ranges are shown properly and this is only used mostly. A dataset of 729 patients can be stored in Mongo DB and in that 129 patients repots are taken for the prediction purpose and the remaining are used for training. The training datasets are used for the prediction purposes.


Author(s):  
Kandala Srujana Kumari Et.al

Diabetes is a common disease in the human body caused by a set of metabolic disorders in which blood sugar levels are very long. It affects various organs in the human body and destroys many-body systems, especially the kidneys and kidneys. Early detection can save lives. To achieve this goal, this study focuses specifically on the use of machine learning techniques for many risk factors associated with this disease. Technical training methods achieve effective results by creating predictive models based on medical diagnostic data collected on Indian sugar. Learning from such data can help in predicting diabetics. In this study, we used four popular machine learning algorithms, namely Support Vector Machine (SVM), Naive Bayes (NB), Near Neighbor K (KNN), and Decision Tree C4.5 (DT), based on statistical data. people. adults in sugar. , preview. The results of our experiments show that the C4.5 solution tree has greater accuracy compared to other machine learning methods.


2020 ◽  
Author(s):  
Liam Brierley ◽  
Anna Fowler

AbstractThe COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 225 and 187 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ∼73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ying Peng ◽  
Cheng Peng ◽  
Zheng Fang ◽  
Gang Chen

Endometriosis, a common disease that presents as polymorphism, invasiveness, and extensiveness, with clinical manifestations including dysmenorrhea, infertility, and menstrual abnormalities, seriously affects quality of life in women. To date, its underlying etiological mechanism of action and the associated regulatory genes remain unclear. This study aimed to identify molecular markers and elucidate mechanisms underlying the development and progression of endometriosis. Specifically, we downloaded five microarray expression datasets, namely, GSE11691, GSE23339, GSE25628, GSE7305, and GSE105764, from the Gene Expression Omnibus (GEO) database. These datasets, obtained from endometriosis tissues, alongside normal controls, were subjected to in-depth bioinformatics analysis for identification of differentially expressed genes (DEGs), followed by analysis of their function and pathways via gene ontology (GO) and KEGG pathway enrichment analyses. Moreover, we constructed a protein–protein interaction (PPI) network to explore the hub genes and modules, and then applied machine learning algorithms support vector machine-recursive feature elimination and least absolute shrinkage and selection operator (LASSO) analysis to identify key genes. Furthermore, we adopted the CIBERSORTx algorithm to estimate levels of immune cell infiltration while the connective map (CMAP) database was used to identify potential therapeutic drugs in endometriosis. As a result, a total of 423 DEGs, namely, 233 and 190 upregulated and downregulated, were identified. On the other hand, a total of 1,733 PPIs were obtained from the PPI network. The DEGs were mainly enriched in immune-related mechanisms. Furthermore, machine learning and LASSO algorithms identified three key genes, namely, apelin receptor (APLNR), C–C motif chemokine ligand 21 (CCL21), and Fc fragment of IgG receptor IIa (FCGR2A). Furthermore, 16 small molecular compounds associated with endometriosis treatment were identified, and their mechanism of action was also revealed. Taken together, the findings of this study provide new insights into the molecular factors regulating occurrence and progression of endometriosis and its underlying mechanism of action. The identified therapeutic drugs and molecular markers may have clinical significance in early diagnosis of endometriosis.


2021 ◽  
Vol 17 (4) ◽  
pp. e1009149
Author(s):  
Liam Brierley ◽  
Anna Fowler

The COVID-19 pandemic has demonstrated the serious potential for novel zoonotic coronaviruses to emerge and cause major outbreaks. The immediate animal origin of the causative virus, SARS-CoV-2, remains unknown, a notoriously challenging task for emerging disease investigations. Coevolution with hosts leads to specific evolutionary signatures within viral genomes that can inform likely animal origins. We obtained a set of 650 spike protein and 511 whole genome nucleotide sequences from 222 and 185 viruses belonging to the family Coronaviridae, respectively. We then trained random forest models independently on genome composition biases of spike protein and whole genome sequences, including dinucleotide and codon usage biases in order to predict animal host (of nine possible categories, including human). In hold-one-out cross-validation, predictive accuracy on unseen coronaviruses consistently reached ~73%, indicating evolutionary signal in spike proteins to be just as informative as whole genome sequences. However, different composition biases were informative in each case. Applying optimised random forest models to classify human sequences of MERS-CoV and SARS-CoV revealed evolutionary signatures consistent with their recognised intermediate hosts (camelids, carnivores), while human sequences of SARS-CoV-2 were predicted as having bat hosts (suborder Yinpterochiroptera), supporting bats as the suspected origins of the current pandemic. In addition to phylogeny, variation in genome composition can act as an informative approach to predict emerging virus traits as soon as sequences are available. More widely, this work demonstrates the potential in combining genetic resources with machine learning algorithms to address long-standing challenges in emerging infectious diseases.


2019 ◽  
Vol 20 (1) ◽  
pp. 113-160 ◽  
Author(s):  
Asif Iqbal Hajamydeen ◽  
Nur Izura Udzir

Observing network traffic flow for anomalies is a common method in Intrusion Detection. More effort has been taken in utilizing the data mining and machine learning algorithms to construct anomaly based intrusion detection systems, but the dependency on the learned models that were built based on earlier network behaviour still exists, which restricts those methods in detecting new or unknown intrusions. Consequently, this investigation proposes a structure to identify an extensive variety of abnormalities by analysing heterogeneous logs, without utilizing either a prepared model of system transactions or the attributes of anomalies. To accomplish this, a current segment (clustering) has been used and a few new parts (filtering, aggregating and feature analysis) have been presented. Several logs from multiple sources are used as input and this data are processed by all the modules of the framework. As each segment is instrumented for a particular undertaking towards a definitive objective, the commitment of each segment towards abnormality recognition is estimated with various execution measurements. Ultimately, the framework is able to detect a broad range of intrusions exist in the logs without using either the attack knowledge or the traffic behavioural models. The result achieved shows the direction or pathway to design anomaly detectors that can utilize raw traffic logs collected from heterogeneous sources on the network monitored and correlate the events across the logs to detect intrusions.


2020 ◽  
Vol 34 (04) ◽  
pp. 3866-3873
Author(s):  
Peter Fenner ◽  
Edward Pyzer-Knapp

Much of machine learning relies on the use of large amounts of data to train models to make predictions. When this data comes from multiple sources, for example when evaluation of data against a machine learning model is offered as a service, there can be privacy issues and legal concerns over the sharing of data. Fully homomorphic encryption (FHE) allows data to be computed on whilst encrypted, which can provide a solution to the problem of data privacy. However, FHE is both slow and restrictive, so existing algorithms must be manipulated to make them work efficiently under the FHE paradigm. Some commonly used machine learning algorithms, such as Gaussian process regression, are poorly suited to FHE and cannot be manipulated to work both efficiently and accurately. In this paper, we show that a modular approach, which applies FHE to only the sensitive steps of a workflow that need protection, allows one party to make predictions on their data using a Gaussian process regression model built from another party's data, without either party gaining access to the other's data, in a way which is both accurate and efficient. This construction is, to our knowledge, the first example of an effectively encrypted Gaussian process.


Diabetes is one of the most common disease for both adults and children. Machine Learning Techniques helps to identify the disease in earlier stage to prevent it. This work presents an effectiveness of Gradient Boosted Classifier which is unfocused in earlier existing works. It is compared with two machine learning algorithms such as Neural Networks, Radom Forest employed on benchmark Standard UCI Pima Indian Dataset. The models created are evaluated by standard measures such as AUC, Recall and Accuracy. As expected, Gradient boosted classifier outperforms other two classifiers in all performance aspects.


2021 ◽  
Author(s):  
Mohammed Almanei ◽  
Omogbai Oleghe ◽  
Sandeep Jagtap ◽  
Konstantinos Salonitis

With the vast amount of data available, and its increasing complexity in manufacturing processes, traditional statistical approaches have started to fall short. This is where machine learning plays a key role, addressing the challenges by bringing the ability to analyse large and complex datasets from multiple sources, finding non-linear and intricate patterns on data, relationships between several factors and their influence on the manufacturing process outputs. This paper demonstrates the advantages and applications of using supervised machine learning techniques in the manufacturing industry. It focuses on binary classification and compares the performance of three different machine learning algorithms: logistic regression, support vector machine, and neural networks. A case study has been conducted on a manufacturing company, using the techniques and algorithms mentioned. The case study focuses on analysing the relationship between different manufacturing process variables and their impact on one key output variable of a product, which in this case is the result of a quality test that measures product performance. The modelling problem has been oriented towards a Boolean goal to predict whether the parts will pass this test.


Sign in / Sign up

Export Citation Format

Share Document