MRI-Based Machine Learning in Differentiation Between Benign and Malignant Breast Lesions

BackgroundDifferential diagnosis between benign and malignant breast lesions is of crucial importance relating to follow-up treatment. Recent development in texture analysis and machine learning may lead to a new solution to this problem.MethodThis current study enrolled a total number of 265 patients (benign breast lesions:malignant breast lesions = 71:194) diagnosed in our hospital and received magnetic resonance imaging between January 2014 and August 2017. Patients were randomly divided into the training group and validation group (4:1), and two radiologists extracted their texture features from the contrast-enhanced T1-weighted images. We performed five different feature selection methods including Distance correlation, Gradient Boosting Decision Tree (GBDT), least absolute shrinkage and selection operator (LASSO), random forest (RF), eXtreme gradient boosting (Xgboost) and five independent classification models were built based on Linear discriminant analysis (LDA) algorithm.ResultsAll five models showed promising results to discriminate malignant breast lesions from benign breast lesions, and the areas under the curve (AUCs) of receiver operating characteristic (ROC) were all above 0.830 in both training and validation groups. The model with a better discriminating ability was the combination of LDA + gradient boosting decision tree (GBDT). The sensitivity, specificity, AUC, and accuracy in the training group were 0.814, 0.883, 0.922, and 0.868, respectively; LDA + random forest (RF) also suggests promising results with the AUC of 0.906 in the training group.ConclusionThe evidence of this study, while preliminary, suggested that a combination of MRI texture analysis and LDA algorithm could discriminate benign breast lesions from malignant breast lesions. Further multicenter researches in this field would be of great help in the validation of the result.

Download Full-text

Modelos de machine learning para predição do sucesso de startups

Revista de Gestão e Projetos ◽

10.5585/gep.v12i2.18942 ◽

2021 ◽

Vol 12 (2) ◽

pp. 28-55

Author(s):

Fabiano Rodrigues ◽

Francisco Aparecido Rodrigues ◽

Thelma Valéria Rocha Rodrigues

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Initial Public Offering ◽

Gradient Boosting ◽

Support Vector ◽

Trade Offs ◽

Extreme Gradient Boosting ◽

Public Offering

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

Development and validation of a difficult laryngoscopy prediction model using machine learning of neck circumference and thyromental height

BMC Anesthesiology ◽

10.1186/s12871-021-01343-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Jong Ho Kim ◽

Haewon Kim ◽

Ji Su Jang ◽

Sung Mi Hwang ◽

So Young Lim ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Confidence Interval ◽

Neck Circumference ◽

Difficult Laryngoscopy ◽

Gradient Boosting ◽

Test Set ◽

Equal Distribution ◽

Light Gradient ◽

Extreme Gradient Boosting

Abstract Background Predicting difficult airway is challengeable in patients with limited airway evaluation. The aim of this study is to develop and validate a model that predicts difficult laryngoscopy by machine learning of neck circumference and thyromental height as predictors that can be used even for patients with limited airway evaluation. Methods Variables for prediction of difficulty laryngoscopy included age, sex, height, weight, body mass index, neck circumference, and thyromental distance. Difficult laryngoscopy was defined as Grade 3 and 4 by the Cormack-Lehane classification. The preanesthesia and anesthesia data of 1677 patients who had undergone general anesthesia at a single center were collected. The data set was randomly stratified into a training set (80%) and a test set (20%), with equal distribution of difficulty laryngoscopy. The training data sets were trained with five algorithms (logistic regression, multilayer perceptron, random forest, extreme gradient boosting, and light gradient boosting machine). The prediction models were validated through a test set. Results The model’s performance using random forest was best (area under receiver operating characteristic curve = 0.79 [95% confidence interval: 0.72–0.86], area under precision-recall curve = 0.32 [95% confidence interval: 0.27–0.37]). Conclusions Machine learning can predict difficult laryngoscopy through a combination of several predictors including neck circumference and thyromental height. The performance of the model can be improved with more data, a new variable and combination of models.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

Journal of Diabetes Research ◽

10.1155/2020/6873891 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Mingyue Xue ◽

Yinxia Su ◽

Chen Li ◽

Shuxia Wang ◽

Hua Yao

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Decision Tree ◽

Type Ii Diabetes ◽

Large Scale ◽

Systolic Pressure ◽

Gradient Boosting ◽

Significant Feature ◽

Type Ii ◽

Extreme Gradient Boosting

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.

Download Full-text

Random forest and extreme gradient boosting algorithms for streamflow modeling using vessel features, and tree-rings

10.21203/rs.3.rs-303081/v1 ◽

2021 ◽

Author(s):

Hossein Sahour ◽

Vahid Gholami ◽

Javad Torkman ◽

Mehdi Vazifedan ◽

Sirwe Saeedi

Keyword(s):

Machine Learning ◽

Random Forest ◽

Tree Rings ◽

Test Site ◽

Machine Learning Algorithms ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Growing Seasons ◽

Extreme Gradient Boosting ◽

Streamflow Modeling

Abstract Monitoring temporal variation of streamflow is necessary for many water resources management plans, yet, such practices are constrained by the absence or paucity of data in many rivers around the world. Using a permanent river in the north of Iran as a test site, a machine learning framework was proposed to model the streamflow data in the three periods of growing seasons based on tree-rings and vessel features of the Zelkova carpinifolia species. First, full-disc samples were taken from 30 trees near the river, and the samples went through preprocessing, cross-dating, standardization, and time series analysis. Two machine learning algorithms, namely random forest (RF) and extreme gradient boosting (XGB), were used to model the relationships between dendrochronology variables (tree-rings and vessel features in the three periods of growing seasons) and the corresponding streamflow rates. The performance of each model was evaluated using statistical coefficients (coefficient of determination (R-squared), Nash-Sutcliffe efficiency (NSE), and root-mean-square error (NRMSE)). Findings demonstrate that consideration should be given to the XGB model in streamflow modeling given its apparent enhanced performance (R-squared: 0.87; NSE: 0.81; and NRMSE: 0.43) over the RF model (R-squared: 0.82; NSE: 0.71; and NRMSE: 0.52). Further, the results showed that the models perform better in modeling the normal and low flows compared to extremely high flows. Finally, the tested models were used to reconstruct the temporal streamflow during the past decades (1970–1981).

Download Full-text

Machine Learning Framework to Predict Chronic Kidney Disease using Ensemble Algorithm

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.d9107.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 1-6

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Random Forest ◽

Kidney Disease ◽

Decision Tree ◽

Performance Metrics ◽

Weighted Average ◽

Gradient Boosting ◽

Support Vector ◽

The Individual

Chronic Kidney Disease (CKD) is a worldwide concern that influences roughly 10% of the grown-up population on the world. For most of the people the early diagnosis of CKD is often not possible. Therefore, the utilization of present-day Computer aided supported strategies is important to help the conventional CKD finding framework to be progressively effective and precise. In this project, six modern machine learning techniques namely Multilayer Perceptron Neural Network, Support Vector Machine, Naïve Bayes, K-Nearest Neighbor, Decision Tree, Logistic regression were used and then to enhance the performance of the model Ensemble Algorithms such as ADABoost, Gradient Boosting, Random Forest, Majority Voting, Bagging and Weighted Average were used on the Chronic Kidney Disease dataset from the UCI Repository. The model was tuned finely to get the best hyper parameters to train the model. The performance metrics used to evaluate the model was measured using Accuracy, Precision, Recall, F1-score, Mathew`s Correlation Coefficient and ROC-AUC curve. The experiment was first performed on the individual classifiers and then on the Ensemble classifiers. The ensemble classifier like Random Forest and ADABoost performed better with 100% Accuracy, Precision and Recall when compared to the individual classifiers with 99.16% accuracy, 98.8% Precision and 100% Recall obtained from Decision Tree Algorithm

Download Full-text

Application of Bayesian Hyperparameter Optimized Random Forest and XGBoost Model for Landslide Susceptibility Mapping

Frontiers in Earth Science ◽

10.3389/feart.2021.712240 ◽

2021 ◽

Vol 9 ◽

Author(s):

Shibao Wang ◽

Jianqi Zhuang ◽

Jia Zheng ◽

Hongyu Fan ◽

Jiaxu Kong ◽

...

Keyword(s):

Random Forest ◽

Decision Tree ◽

Landslide Susceptibility ◽

Susceptibility Mapping ◽

Landslide Susceptibility Mapping ◽

Gradient Boosting ◽

The Loess Plateau ◽

Tree Model ◽

Validation Data ◽

Extreme Gradient Boosting

Landslides are widely distributed worldwide and often result in tremendous casualties and economic losses, especially in the Loess Plateau of China. Taking Wuqi County in the hinterland of the Loess Plateau as the research area, using Bayesian hyperparameters to optimize random forest and extreme gradient boosting decision trees model for landslide susceptibility mapping, and the two optimized models are compared. In addition, 14 landslide influencing factors are selected, and 734 landslides are obtained according to field investigation and reports from literals. The landslides were randomly divided into training data (70%) and validation data (30%). The hyperparameters of the random forest and extreme gradient boosting decision tree models were optimized using a Bayesian algorithm, and then the optimal hyperparameters are selected for landslide susceptibility mapping. Both models were evaluated and compared using the receiver operating characteristic curve and confusion matrix. The results show that the AUC validation data of the Bayesian optimized random forest and extreme gradient boosting decision tree model are 0.88 and 0.86, respectively, which showed an improvement of 4 and 3%, indicating that the prediction performance of the two models has been improved. However, the random forest model has a higher predictive ability than the extreme gradient boosting decision tree model. Thus, hyperparameter optimization is of great significance in the improvement of the prediction accuracy of the model. Therefore, the optimized model can generate a high-quality landslide susceptibility map.

Download Full-text

Research on dairy products detection based on machine learning algorithm

MATEC Web of Conferences ◽

10.1051/matecconf/202235503008 ◽

2022 ◽

Vol 355 ◽

pp. 03008

Author(s):

Yang Zhang ◽

Lei Zhang ◽

Yabin Ma ◽

Jinsen Guan ◽

Zhaoxia Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Electronic Nose ◽

Milk Fat ◽

Dairy Products ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

In this study, an electronic nose model composed of seven kinds of metal oxide semiconductor sensors was developed to distinguish the milk source (the dairy farm to which milk belongs), estimate the content of milk fat and protein in milk, to identify the authenticity and evaluate the quality of milk. The developed electronic nose is a low-cost and non-destructive testing equipment. (1) For the identification of milk sources, this paper uses the method of combining the electronic nose odor characteristics of milk and the component characteristics to distinguish different milk sources, and uses Principal Component Analysis (PCA) and Linear Discriminant Analysis , LDA) for dimensionality reduction analysis, and finally use three machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to build a milk source (cow farm) Identify the model and evaluate and compare the classification effects. The experimental results prove that the classification effect of the SVM-LDA model based on the electronic nose odor characteristics is better than other single feature models, and the accuracy of the test set reaches 91.5%. The RF-LDA and SVM-LDA models based on the fusion feature of the two have the best effect Set accuracy rate is as high as 96%. (2) The three algorithms, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) and Random Forest (RF), are used to construct the electronic nose odor data for milk fat rate and protein rate. The method of estimating the model, the results show that the RF model has the best estimation performance( R2 =0.9399 for milk fat; R2=0.9301for milk protein). And it prove that the method proposed in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of dairy products.

Download Full-text