The use of machine learning in rare diseases: a scoping review

Abstract Background Emerging machine learning technologies are beginning to transform medicine and healthcare and could also improve the diagnosis and treatment of rare diseases. Currently, there are no systematic reviews that investigate, from a general perspective, how machine learning is used in a rare disease context. This scoping review aims to address this gap and explores the use of machine learning in rare diseases, investigating, for example, in which rare diseases machine learning is applied, which types of algorithms and input data are used or which medical applications (e.g., diagnosis, prognosis or treatment) are studied. Methods Using a complex search string including generic search terms and 381 individual disease names, studies from the past 10 years (2010–2019) that applied machine learning in a rare disease context were identified on PubMed. To systematically map the research activity, eligible studies were categorized along different dimensions (e.g., rare disease group, type of algorithm, input data), and the number of studies within these categories was analyzed. Results Two hundred eleven studies from 32 countries investigating 74 different rare diseases were identified. Diseases with a higher prevalence appeared more often in the studies than diseases with a lower prevalence. Moreover, some rare disease groups were investigated more frequently than to be expected (e.g., rare neurologic diseases and rare systemic or rheumatologic diseases), others less frequently (e.g., rare inborn errors of metabolism and rare skin diseases). Ensemble methods (36.0%), support vector machines (32.2%) and artificial neural networks (31.8%) were the algorithms most commonly applied in the studies. Only a small proportion of studies evaluated their algorithms on an external data set (11.8%) or against a human expert (2.4%). As input data, images (32.2%), demographic data (27.0%) and “omics” data (26.5%) were used most frequently. Most studies used machine learning for diagnosis (40.8%) or prognosis (38.4%) whereas studies aiming to improve treatment were relatively scarce (4.7%). Patient numbers in the studies were small, typically ranging from 20 to 99 (35.5%). Conclusion Our review provides an overview of the use of machine learning in rare diseases. Mapping the current research activity, it can guide future work and help to facilitate the successful application of machine learning in rare diseases.

Download Full-text

Detecting Rare Diseases in Electronic Health Records Using Machine Learning and Knowledge Engineering: Case Study of Acute Hepatic Porphyria

10.1101/2020.04.09.20052449 ◽

2020 ◽

Author(s):

Aaron Cohen ◽

Steven Chamberlin ◽

Thomas Deloughery ◽

Michelle Nguyen ◽

Steven Bedrick ◽

...

Keyword(s):

Machine Learning ◽

Rare Diseases ◽

Knowledge Engineering ◽

Medical Center ◽

Academic Medical Center ◽

Diagnostic Testing ◽

Machine Learning Algorithms ◽

Support Vector ◽

Data Set ◽

Electronic Health

AbstractBackgroundWith the growing adoption of the electronic health record (EHR) worldwide over the last decade, new opportunities exist for leveraging EHR data for detection of rare diseases. Rare diseases are often not diagnosed or delayed in diagnosis by clinicians who encounter them infrequently. One such rare disease that may be amenable to EHR-based detection is acute hepatic porphyria (AHP). AHP consists of a family of rare, metabolic diseases characterized by potentially life-threatening acute attacks and, for some patients, chronic debilitating symptoms that negatively impact daily functioning and quality of life. The goal of this study was to apply machine learning and knowledge engineering to a large extract of EHR data to determine whether they could be effective in identifying patients not previously tested for AHP who should receive a proper diagnostic workup for AHP.Methods and FindingsWe used an extract of the complete EHR data of 200,000 patients from an academic medical center for up to 10 years longitudinally and enriched it with records from an additional 5,571 patients from the center containing any mention of porphyria in notes, laboratory tests, diagnosis codes, and other parts of the record. After manually reviewing all patients with the ICD-10-CM code E80.21 (Acute intermittent [hepatic] porphyria), we identified 30 patients who were positive cases for our machine learning models, with the rest of the patients used as negative cases. We parsed the record into features, which were scored by frequency of appearance and labeled by the EHR source document. We then carried out a univariate feature analysis, manually choosing features not directly tied to provider attributes or suspicion of the patient having AHP. We next trained on the full dataset, with the best cross-validation performance coming from support vector machine (SVM) algorithm using a radial basis function (RBF) kernel. The trained model was applied back to the full data set and patients were ranked by margin distance. The top 100 ranked negative cases were manually reviewed for symptom complexes similar to AHP, finding four patients where AHP diagnostic testing was likely indicated and 18 patients where AHP diagnostic testing was possibly indicated. From the top 100 ranked cases of patients with mention of porphyria in their record, we identified four patients for whom AHP diagnostic testing was possibly indicated and had not been previously performed. Based solely on the reported prevalence of AHP, we would have expected only 0.002 cases out of the 200 patients manually reviewed.ConclusionsThe application of machine learning and knowledge engineering to EHR data may facilitate the diagnosis of rare diseases such as AHP. The only manual modifications to this work were the removal of disease-specific or medical center specific features that might undermine our ability to find new cases. Further work will recommend clinical investigation to identified patients’ clinicians, evaluate more patients, assess additional feature selection and machine learning algorithms, and apply this methodology to other rare diseases.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Predictive modeling for peri-implantitis by using machine learning techniques

Scientific Reports ◽

10.1038/s41598-021-90642-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Tomoaki Mameno ◽

Masahiro Wada ◽

Kazunori Nozaki ◽

Toshihito Takahashi ◽

Yoshitaka Tsujioka ◽

...

Keyword(s):

Machine Learning ◽

Demographic Data ◽

Risk Indicators ◽

Machine Learning Techniques ◽

Support Vector ◽

Machine Learning Methods ◽

Complex Interactions ◽

Learning Techniques ◽

Increased Risk ◽

Vector Machines

AbstractThe purpose of this retrospective cohort study was to create a model for predicting the onset of peri-implantitis by using machine learning methods and to clarify interactions between risk indicators. This study evaluated 254 implants, 127 with and 127 without peri-implantitis, from among 1408 implants with at least 4 years in function. Demographic data and parameters known to be risk factors for the development of peri-implantitis were analyzed with three models: logistic regression, support vector machines, and random forests (RF). As the results, RF had the highest performance in predicting the onset of peri-implantitis (AUC: 0.71, accuracy: 0.70, precision: 0.72, recall: 0.66, and f1-score: 0.69). The factor that had the most influence on prediction was implant functional time, followed by oral hygiene. In addition, PCR of more than 50% to 60%, smoking more than 3 cigarettes/day, KMW less than 2 mm, and the presence of less than two occlusal supports tended to be associated with an increased risk of peri-implantitis. Moreover, these risk indicators were not independent and had complex effects on each other. The results of this study suggest that peri-implantitis onset was predicted in 70% of cases, by RF which allows consideration of nonlinear relational data with complex interactions.

Download Full-text

A sentiment analysis system for social media using machine learning techniques: Social enablement

Digital Scholarship in the Humanities ◽

10.1093/llc/fqy037 ◽

2018 ◽

Vol 34 (3) ◽

pp. 569-581 ◽

Cited By ~ 1

Author(s):

Sujata Rani ◽

Parteek Kumar

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Media Analysis ◽

Training Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Analysis Tool ◽

Data Set ◽

Learning Techniques

Abstract In this article, an innovative approach to perform the sentiment analysis (SA) has been presented. The proposed system handles the issues of Romanized or abbreviated text and spelling variations in the text to perform the sentiment analysis. The training data set of 3,000 movie reviews and tweets has been manually labeled by native speakers of Hindi in three classes, i.e. positive, negative, and neutral. The system uses WEKA (Waikato Environment for Knowledge Analysis) tool to convert these string data into numerical matrices and applies three machine learning techniques, i.e. Naive Bayes (NB), J48, and support vector machine (SVM). The proposed system has been tested on 100 movie reviews and tweets, and it has been observed that SVM has performed best in comparison to other classifiers, and it has an accuracy of 68% for movie reviews and 82% in case of tweets. The results of the proposed system are very promising and can be used in emerging applications like SA of product reviews and social media analysis. Additionally, the proposed system can be used in other cultural/social benefits like predicting/fighting human riots.

Download Full-text

Estimation of Soil Cohesion Using Machine Learning Method: A Random Forest Approach

Advances in Civil Engineering ◽

10.1155/2021/8873993 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Hai-Bang Ly ◽

Thuy-Anh Nguyen ◽

Binh Thai Pham

Keyword(s):

Machine Learning ◽

Random Forest ◽

Soil Properties ◽

Clay Content ◽

Absolute Error ◽

Experimental Methods ◽

Liquid Limit ◽

Support Vector ◽

Data Set ◽

Soil Cohesion

Soil cohesion (C) is one of the critical soil properties and is closely related to basic soil properties such as particle size distribution, pore size, and shear strength. Hence, it is mainly determined by experimental methods. However, the experimental methods are often time-consuming and costly. Therefore, developing an alternative approach based on machine learning (ML) techniques to solve this problem is highly recommended. In this study, machine learning models, namely, support vector machine (SVM), Gaussian regression process (GPR), and random forest (RF), were built based on a data set of 145 soil samples collected from the Da Nang-Quang Ngai expressway project, Vietnam. The database also includes six input parameters, that is, clay content, moisture content, liquid limit, plastic limit, specific gravity, and void ratio. The performance of the model was assessed by three statistical criteria, namely, the correlation coefficient (R), mean absolute error (MAE), and root mean square error (RMSE). The results demonstrated that the proposed RF model could accurately predict soil cohesion with high accuracy (R = 0.891) and low error (RMSE = 3.323 and MAE = 2.511), and its predictive capability is better than SVM and GPR. Therefore, the RF model can be used as a cost-effective approach in predicting soil cohesion forces used in the design and inspection of constructions.

Download Full-text

Big Data for Health Care Analytics using Extreme Machine Learning Based on Map Reduce

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5808.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2758-2762

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Storage ◽

Clinical Data ◽

Disease Risk ◽

Learning Algorithm ◽

Information Storage ◽

Support Vector ◽

Machine Learning Algorithm ◽

Data Set

A large volume of datasets is available in various fields that are stored to be somewhere which is called big data. Big Data healthcare has clinical data set of every patient records in huge amount and they are maintained by Electronic Health Records (EHR). More than 80 % of clinical data is the unstructured format and reposit in hundreds of forms. The challenges and demand for data storage, analysis is to handling large datasets in terms of efficiency and scalability. Hadoop Map reduces framework uses big data to store and operate any kinds of data speedily. It is not solely meant for storage system however conjointly a platform for information storage moreover as processing. It is scalable and fault-tolerant to the systems. Also, the prediction of the data sets is handled by machine learning algorithm. This work focuses on the Extreme Machine Learning algorithm (ELM) that can utilize the optimized way of finding a solution to find disease risk prediction by combining ELM with Cuckoo Search optimization-based Support Vector Machine (CS-SVM). The proposed work also considers the scalability and accuracy of big data models, thus the proposed algorithm greatly achieves the computing work and got good results in performance of both veracity and efficiency.

Download Full-text

Quantitative Methods for Analyzing Intimate Partner Violence in Microblogs: Observational Study

Journal of Medical Internet Research ◽

10.2196/15347 ◽

2020 ◽

Vol 22 (11) ◽

pp. e15347

Author(s):

Christopher Michael Homan ◽

J Nicolas Schrading ◽

Raymond W Ptucha ◽

Catherine Cerulli ◽

Cecilia Ovesdotter Alm

Keyword(s):

Machine Learning ◽

Social Media ◽

Intimate Partner Violence ◽

Language Processing ◽

Partner Violence ◽

Quantitative Methods ◽

Intimate Partner ◽

Support Vector ◽

Data Set ◽

Part Of Speech

Background Social media is a rich, virtually untapped source of data on the dynamics of intimate partner violence, one that is both global in scale and intimate in detail. Objective The aim of this study is to use machine learning and other computational methods to analyze social media data for the reasons victims give for staying in or leaving abusive relationships. Methods Human annotation, part-of-speech tagging, and machine learning predictive models, including support vector machines, were used on a Twitter data set of 8767 #WhyIStayed and #WhyILeft tweets each. Results Our methods explored whether we can analyze micronarratives that include details about victims, abusers, and other stakeholders, the actions that constitute abuse, and how the stakeholders respond. Conclusions Our findings are consistent across various machine learning methods, which correspond to observations in the clinical literature, and affirm the relevance of natural language processing and machine learning for exploring issues of societal importance in social media.

Download Full-text

Improving Real-Time Drilling Data Quality Using Artificial Intelligence and Machine Learning Techniques

10.2118/204658-ms ◽

2021 ◽

Author(s):

S. H. Al Gharbi ◽

A. A. Al-Majed ◽

A. Abdulraheem ◽

S. Patil ◽

S. M. Elkatatny

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Quality ◽

Real Time ◽

Input Data ◽

Support Vector ◽

The Real ◽

Drilling Data ◽

Drilling Operations

Abstract Due to high demand for energy, oil and gas companies started to drill wells in remote areas and unconventional environments. This raised the complexity of drilling operations, which were already challenging and complex. To adapt, drilling companies expanded their use of the real-time operation center (RTOC) concept, in which real-time drilling data are transmitted from remote sites to companies’ headquarters. In RTOC, groups of subject matter experts monitor the drilling live and provide real-time advice to improve operations. With the increase of drilling operations, processing the volume of generated data is beyond a human's capability, limiting the RTOC impact on certain components of drilling operations. To overcome this limitation, artificial intelligence and machine learning (AI/ML) technologies were introduced to monitor and analyze the real-time drilling data, discover hidden patterns, and provide fast decision-support responses. AI/ML technologies are data-driven technologies, and their quality relies on the quality of the input data: if the quality of the input data is good, the generated output will be good; if not, the generated output will be bad. Unfortunately, due to the harsh environments of drilling sites and the transmission setups, not all of the drilling data is good, which negatively affects the AI/ML results. The objective of this paper is to utilize AI/ML technologies to improve the quality of real-time drilling data. The paper fed a large real-time drilling dataset, consisting of over 150,000 raw data points, into Artificial Neural Network (ANN), Support Vector Machine (SVM) and Decision Tree (DT) models. The models were trained on the valid and not-valid datapoints. The confusion matrix was used to evaluate the different AI/ML models including different internal architectures. Despite the slowness of ANN, it achieved the best result with an accuracy of 78%, compared to 73% and 41% for DT and SVM, respectively. The paper concludes by presenting a process for using AI technology to improve real-time drilling data quality. To the author's knowledge based on literature in the public domain, this paper is one of the first to compare the use of multiple AI/ML techniques for quality improvement of real-time drilling data. The paper provides a guide for improving the quality of real-time drilling data.

Download Full-text

A Review of Machine Learning Techniques for Anomaly Detection in Static Graphs

Implementing Computational Intelligence Techniques for Security Systems Design - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-7998-2418-3.ch007 ◽

2020 ◽

pp. 146-162

Author(s):

Hesham M. Al-Ammal

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Anomaly Detection ◽

Real Life ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Methods ◽

Data Set ◽

Learning Techniques ◽

Vector Machines

Detection of anomalies in a given data set is a vital step in several applications in cybersecurity; including intrusion detection, fraud, and social network analysis. Many of these techniques detect anomalies by examining graph-based data. Analyzing graphs makes it possible to capture relationships, communities, as well as anomalies. The advantage of using graphs is that many real-life situations can be easily modeled by a graph that captures their structure and inter-dependencies. Although anomaly detection in graphs dates back to the 1990s, recent advances in research utilized machine learning methods for anomaly detection over graphs. This chapter will concentrate on static graphs (both labeled and unlabeled), and the chapter summarizes some of these recent studies in machine learning for anomaly detection in graphs. This includes methods such as support vector machines, neural networks, generative neural networks, and deep learning methods. The chapter will reflect the success and challenges of using these methods in the context of graph-based anomaly detection.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text