Machine Learning Models
Recently Published Documents





2022 ◽  
Vol 9 (3) ◽  
pp. 1-22
Mohammad Daradkeh

This study presents a data analytics framework that aims to analyze topics and sentiments associated with COVID-19 vaccine misinformation in social media. A total of 40,359 tweets related to COVID-19 vaccination were collected between January 2021 and March 2021. Misinformation was detected using multiple predictive machine learning models. Latent Dirichlet Allocation (LDA) topic model was used to identify dominant topics in COVID-19 vaccine misinformation. Sentiment orientation of misinformation was analyzed using a lexicon-based approach. An independent-samples t-test was performed to compare the number of replies, retweets, and likes of misinformation with different sentiment orientations. Based on the data sample, the results show that COVID-19 vaccine misinformation included 21 major topics. Across all misinformation topics, the average number of replies, retweets, and likes of tweets with negative sentiment was 2.26, 2.68, and 3.29 times higher, respectively, than those with positive sentiment.

2021 ◽  
Vol 12 ◽  
Junzhao Cui ◽  
Jingyi Yang ◽  
Kun Zhang ◽  
Guodong Xu ◽  
Ruijie Zhao ◽  

Objectives: Patients with anterior circulation large vessel occlusion are at high risk of acute ischemic stroke, which could be disabling or fatal. In this study, we applied machine learning to develop and validate two prediction models for acute ischemic stroke (Model 1) and severity of neurological impairment (Model 2), both caused by anterior circulation large vessel occlusion (AC-LVO), based on medical history and neuroimaging data of patients on admission.Methods: A total of 1,100 patients with AC- LVO from the Second Hospital of Hebei Medical University in North China were enrolled, of which 713 patients presented with acute ischemic stroke (AIS) related to AC- LVO and 387 presented with the non-acute ischemic cerebrovascular event. Among patients with the non-acute ischemic cerebrovascular events, 173 with prior stroke or TIA were excluded. Finally, 927 patients with AC-LVO were entered into the derivation cohort. In the external validation cohort, 150 patients with AC-LVO from the Hebei Province People's Hospital, including 99 patients with AIS related to AC- LVO and 51 asymptomatic AC-LVO patients, were retrospectively reviewed. We developed four machine learning models [logistic regression (LR), regularized LR (RLR), support vector machine (SVM), and random forest (RF)], whose performance was internally validated using 5-fold cross-validation. The performance of each machine learning model for the area under the receiver operating characteristic curve (ROC-AUC) was compared and the variables of each algorithm were ranked.Results: In model 1, among the included patients with AC-LVO, 713 (76.9%) and 99 (66%) suffered an acute ischemic stroke in the derivation and external validation cohorts, respectively. The ROC-AUC of LR, RLR and SVM were significantly higher than that of the RF in the external validation cohorts [0.66 (95% CI 0.57–0.74) for LR, 0.66 (95% CI 0.57–0.74) for RLR, 0.55 (95% CI 0.45–0.64) for RF and 0.67 (95% CI 0.58–0.76) for SVM]. In model 2, 254 (53.9%) and 31 (37.8%) patients suffered disabling ischemic stroke in the derivation and external validation cohorts, respectively. There was no difference in AUC among the four machine learning algorithms in the external validation cohorts.Conclusions: Machine learning methods with multiple clinical variables have the ability to predict acute ischemic stroke and the severity of neurological impairment in patients with AC-LVO.

2021 ◽  
Vol 14 (12) ◽  
pp. 7411-7424
Moritz Lange ◽  
Henri Suominen ◽  
Mona Kurppa ◽  
Leena Järvi ◽  
Emilia Oikarinen ◽  

Abstract. Running large-eddy simulations (LESs) can be burdensome and computationally too expensive from the application point of view, for example, to support urban planning. In this study, regression models are used to replicate modelled air pollutant concentrations from LES in urban boulevards. We study the performance of regression models and discuss how to detect situations where the models are applied outside their training domain and their outputs cannot be trusted. Regression models from 10 different model families are trained and a cross-validation methodology is used to evaluate their performance and to find the best set of features needed to reproduce the LES outputs. We also test the regression models on an independent testing dataset. Our results suggest that in general, log-linear regression gives the best and most robust performance on new independent data. It clearly outperforms the dummy model which would predict constant concentrations for all locations (multiplicative minimum RMSE (mRMSE) of 0.76 vs. 1.78 of the dummy model). Furthermore, we demonstrate that it is possible to detect concept drift, i.e. situations where the model is applied outside its training domain and a new LES run may be necessary to obtain reliable results. Regression models can be used to replace LES simulations in estimating air pollutant concentrations, unless higher accuracy is needed. In order to have reliable results, it is however important to do the model and feature selection carefully to avoid overfitting and to use methods to detect the concept drift.

2021 ◽  
Vol 10 (1-2) ◽  
pp. 30-42
Guan-Yuan Wang

Abstract Since the smartphone market is an oligopoly market structure, consumer purchase intention is usually driven by brand preference. This research analyses the customer-to-customer market of second-hand smartphones, pointing out how the brand factor affects the consumers’ purchasing behaviour. It is found that the recovery value and life cycle of Apple smartphones are higher and longer than those of other brands. Moreover, the recovery value of other brand smartphones is significantly driven by the debut date of the Apple smartphones, implicitly forming a consumption cycle. In addition, through machine learning models, the predictability for the recovery value is able to reach 93.55%.

2021 ◽  
Koseki J. Kobayashi-Kirschvink ◽  
Shreya Gaddam ◽  
Taylor James-Sorenson ◽  
Emanuelle Grody ◽  
Johain R. Ounadjela ◽  

Single cell RNA-Seq (scRNA-seq) and other profiling assays have opened new windows into understanding the properties, regulation, dynamics, and function of cells at unprecedented resolution and scale. However, these assays are inherently destructive, precluding us from tracking the temporal dynamics of live cells, in cell culture or whole organisms. Raman microscopy offers a unique opportunity to comprehensively report on the vibrational energy levels of molecules in a label-free and non-destructive manner at a subcellular spatial resolution, but it lacks in genetic and molecular interpretability. Here, we developed Raman2RNA (R2R), an experimental and computational framework to infer single-cell expression profiles in live cells through label-free hyperspectral Raman microscopy images and multi-modal data integration and domain translation. We used spatially resolved single-molecule RNA-FISH (smFISH) data as anchors to link scRNA-seq profiles to the paired spatial hyperspectral Raman images, and trained machine learning models to infer expression profiles from Raman spectra at the single-cell level. In reprogramming of mouse fibroblasts into induced pluripotent stem cells (iPSCs), R2R accurately (r>0.96) inferred from Raman images the expression profiles of various cell states and fates, including iPSCs, mesenchymal-epithelial transition (MET) cells, stromal cells, epithelial cells, and fibroblasts. R2R outperformed inference from brightfield images, showing the importance of spectroscopic content afforded by Raman microscopy. Raman2RNA lays a foundation for future investigations into exploring single-cell genome-wide molecular dynamics through imaging data, in vitro and in vivo.

Cancers ◽  
2021 ◽  
Vol 13 (23) ◽  
pp. 6065
Ana Rodrigues ◽  
João Santinha ◽  
Bernardo Galvão ◽  
Celso Matos ◽  
Francisco M. Couto ◽  

Prostate cancer is one of the most prevalent cancers in the male population. Its diagnosis and classification rely on unspecific measures such as PSA levels and DRE, followed by biopsy, where an aggressiveness level is assigned in the form of Gleason Score. Efforts have been made in the past to use radiomics coupled with machine learning to predict prostate cancer aggressiveness from clinical images, showing promising results. Thus, the main goal of this work was to develop supervised machine learning models exploiting radiomic features extracted from bpMRI examinations, to predict biological aggressiveness; 288 classifiers were developed, corresponding to different combinations of pipeline aspects, namely, type of input data, sampling strategy, feature selection method and machine learning algorithm. On a cohort of 281 lesions from 183 patients, it was found that (1) radiomic features extracted from the lesion volume of interest were less stable to segmentation than the equivalent extraction from the whole gland volume of interest; and (2) radiomic features extracted from the whole gland volume of interest produced higher performance and less overfitted classifiers than radiomic features extracted from the lesions volumes of interest. This result suggests that the areas surrounding the tumour lesions offer relevant information regarding the Gleason Score that is ultimately attributed to that lesion.

2021 ◽  
Vol 7 (1) ◽  
Elisabeth J. Schiessler ◽  
Tim Würger ◽  
Sviatlana V. Lamaka ◽  
Robert H. Meißner ◽  
Christian J. Cyron ◽  

AbstractThe degradation behaviour of magnesium and its alloys can be tuned by small organic molecules. However, an automatic identification of effective organic additives within the vast chemical space of potential compounds needs sophisticated tools. Herein, we propose two systematic approaches of sparse feature selection for identifying molecular descriptors that are most relevant for the corrosion inhibition efficiency of chemical compounds. One is based on the classical statistical tool of analysis of variance, the other one based on random forests. We demonstrate how both can—when combined with deep neural networks—help to predict the corrosion inhibition efficiencies of chemical compounds for the magnesium alloy ZE41. In particular, we demonstrate that this framework outperforms predictions relying on a random selection of molecular descriptors. Finally, we point out how autoencoders could be used in the future to enable even more accurate automated predictions of corrosion inhibition efficiencies.

2021 ◽  
Vol 14 (23) ◽  
Hany Gamal ◽  
Salaheldin Elkatatny ◽  
Ahmed Abdulhamid Mahmoud

Andrew McDonald ◽  

Decades of subsurface exploration and characterization have led to the collation and storage of large volumes of well-related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine-learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data are of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well-log data, which can be highly multidimensional, diverse, and stored in a variety of file formats. Well-log data exhibits key characteristics of big data: volume, variety, velocity, veracity, and value. Well data can include numeric values, text values, waveform data, image arrays, maps, and volumes. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine-learning models. Well-log data can be affected by numerous issues causing a degradation in data quality. These include missing data ranging from single data points to entire curves, noisy data from tool-related issues, borehole washout, processing issues, incorrect environmental corrections, and mislabeled data. Having vast quantities of data does not mean it can all be passed into a machine-learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data are passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, but it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well-log data and deploying machine-learning models. This is achieved by first providing an overview of machine learning and big data within the petrophysical domain, followed by a review of the common well-log data issues, their impact on machine-learning algorithms, and methods for mitigating their influence.

2021 ◽  
Igor Soares ◽  
Fernando Camargo ◽  
Adriano Marques ◽  
Oliver Crook

Abstract Genome engineering is undergoing unprecedented development and is now becoming widely available. To ensure responsible biotechnology innovation and to reduce misuse of engineered DNA sequences, it is vital to develop tools to identify the lab-of-origin of engineered plasmids. Genetic engineering attribution (GEA), the ability to make sequence-lab associations, would supportforensic experts in this process. Here, we propose a method, based on metric learning, that ranks the most likely labs-of-origin whilstsimultaneously generating embeddings for plasmid sequences and labs. These embeddings can be used to perform various downstreamtasks, such as clustering DNA sequences and labs, as well as using them as features in machine learning models. Our approach employsa circular shift augmentation approach and is able to correctly rank the lab-of-origin90%of the time within its top 10 predictions -outperforming all current state-of-the-art approaches. We also demonstrate that we can perform few-shot-learning and obtain76%top-10 accuracy using only10%of the sequences. This means, we outperform the previous CNN approach using only one-tenth of the data. We also demonstrate that we are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model’s outputs.CCS Concepts: Information systems→Similarity measures; Learning to rank.

Sign in / Sign up

Export Citation Format

Share Document