scholarly journals Bio, psycho, or social: supervised machine learning to classify discursive framing of depression in online health communities

Author(s):  
Renáta Németh ◽  
Fanni Máté ◽  
Eszter Katona ◽  
Márton Rakovics ◽  
Domonkos Sik

AbstractSupervised machine learning on textual data has successful industrial/business applications, but it is an open question whether it can be utilized in social knowledge building outside the scope of hermeneutically more trivial cases. Combining sociology and data science raises several methodological and epistemological questions. In our study the discursive framing of depression is explored in online health communities. Three discursive frameworks are introduced: the bio-medical, psychological, and social framings of depression. ~80 000 posts were collected, and a sample of them was manually classified. Conventional bag-of-words models, Gradient Boosting Machine, word-embedding-based models and a state-of-the-art Transformer-based model with transfer learning, called DistilBERT were applied to expand this classification on the whole database. According to our experience ‘discursive framing’ proves to be a complex and hermeneutically difficult concept, which affects the degree of both inter-annotator agreement and predictive performance. Our finding confirms that the level of inter-annotator disagreement provides a good estimate for the objective difficulty of the classification. By identifying the most important terms, we also interpreted the classification algorithms, which is of great importance in social sciences. We are convinced that machine learning techniques can extend the horizon of qualitative text analysis. Our paper supports a smooth fit of the new techniques into the traditional toolbox of social sciences.

2020 ◽  
Vol 28 (2) ◽  
pp. 253-265 ◽  
Author(s):  
Gabriela Bitencourt-Ferreira ◽  
Amauri Duarte da Silva ◽  
Walter Filgueira de Azevedo

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.


2020 ◽  
pp. 1-26
Author(s):  
Joshua Eykens ◽  
Raf Guns ◽  
Tim C.E. Engels

We compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting dataset consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multi-label dataset is used to train the machine learning algorithms in different configurations. We deploy a multi-label classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.


Symmetry ◽  
2021 ◽  
Vol 13 (3) ◽  
pp. 403
Author(s):  
Muhammad Waleed ◽  
Tai-Won Um ◽  
Tariq Kamal ◽  
Syed Muhammad Usman

In this paper, we apply the multi-class supervised machine learning techniques for classifying the agriculture farm machinery. The classification of farm machinery is important when performing the automatic authentication of field activity in a remote setup. In the absence of a sound machine recognition system, there is every possibility of a fraudulent activity taking place. To address this need, we classify the machinery using five machine learning techniques—K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and Gradient Boosting (GB). For training of the model, we use the vibration and tilt of machinery. The vibration and tilt of machinery are recorded using the accelerometer and gyroscope sensors, respectively. The machinery included the leveler, rotavator and cultivator. The preliminary analysis on the collected data revealed that the farm machinery (when in operation) showed big variations in vibration and tilt, but observed similar means. Additionally, the accuracies of vibration-based and tilt-based classifications of farm machinery show good accuracy when used alone (with vibration showing slightly better numbers than the tilt). However, the accuracies improve further when both (the tilt and vibration) are used together. Furthermore, all five machine learning algorithms used for classification have an accuracy of more than 82%, but random forest was the best performing. The gradient boosting and random forest show slight over-fitting (about 9%), but both algorithms produce high testing accuracy. In terms of execution time, the decision tree takes the least time to train, while the gradient boosting takes the most time.


Data Science in healthcare is a innovative and capable for industry implementing the data science applications. Data analytics is recent science in to discover the medical data set to explore and discover the disease. It’s a beginning attempt to identify the disease with the help of large amount of medical dataset. Using this data science methodology, it makes the user to find their disease without the help of health care centres. Healthcare and data science are often linked through finances as the industry attempts to reduce its expenses with the help of large amounts of data. Data science and medicine are rapidly developing, and it is important that they advance together. Health care information is very effective in the society. In a human life day to day heart disease had increased. Based on the heart disease to monitor different factors in human body to analyse and prevent the heart disease. To classify the factors using the machine learning algorithms and to predict the disease is major part. Major part of involves machine level based supervised learning algorithm such as SVM, Naviebayes, Decision Trees and Random forest.


2021 ◽  
Author(s):  
Massimiliano Greco ◽  
Giovanni Angelotti ◽  
Pier Francesco Caruso ◽  
Alberto Zanella ◽  
Niccolò Stomeo ◽  
...  

Abstract Introduction: SARS-CoV-2 infection was first identified at the end of 2019 in China, and subsequently spread globally. COVID-19 disease frequently affects the lungs leading to bilateral viral pneumonia, progressing in some cases to severe respiratory failure requiring ICU admission and mechanical ventilation. Risk stratification at ICU admission is fundamental for resource allocation and decision making, considering that baseline comorbidities, age, and patient conditions at admission have been associated to poorer outcomes. Supervised machine learning techniques are increasingly diffuse in clinical medicine and can predict mortality and test associations reaching high predictive performance. We assessed performances of a machine learning approach to predict mortality in COVID-19 patients admitted to ICU using data from the Lombardy ICU Network.Methods: this is a secondary analysis of prospectively collected data from Lombardy ICU network. To predict survival at 7-,14- and 28 days we built two different models; model A included patient demographics, medications before admission and comorbidities, while model B also included the data of the first day since ICU admission. 10-fold cross validation was repeated 2500 times, to ensure optimal hyperparameter choice. The only constrain imposed to model optimization was the choice of logistic regression as final layer to increase clinical interpretability. Different imputation and over-sampling techniques were employed in model training.Results 1503 patients were included, with 766 deaths (51%). Exploratory analysis and Kaplan-Meier curves demonstrated mortality association with age and gender. Model A and B reached the greatest predictive performance at 28 days (AUC 0.77 and 0.79), with lower performance at 14 days (AUC 0.72 and 0.74) and 7 days (AUC 0.68 and 0.71). Male gender, age and number of comorbidities were strongly associated with mortality in both models. Among comorbidities, chronic kidney disease and chronic obstructive pulmonary disease demonstrated association. Mode of ventilatory assistance at ICU admission and Fraction of Inspired oxygen were associated with mortality in model B.Conclusions Supervised machine learning models demonstrated good performance in prediction of 28-day mortality. 7-days and 14-days predictions demonstrated lower performance. Machine learning techniques may be useful in emergency phases to reach higher predictive performance with reduced human supervision using complex data.


2021 ◽  
Vol 309 ◽  
pp. 01218
Author(s):  
P. Lakshmi Sruthi ◽  
K. Butchi Raju

COVID-19 is a global epidemic that has spread to over 170 nations. In practically all of the countries affected, the number of infected and death cases has been rising rapidly. Forecasting approaches can be implemented, resulting in the development of more effective strategies and the making of more informed judgments. These strategies examine historical data in order to make more accurate predictions about what will happen in the future. These forecasts could aid in preparing for potential risks and consequences. In order to create accurate findings, forecasting techniques are crucial. Forecasting strategies based on Big data analytics acquired from National databases (or) World Health Organization, as well as machine learning (or) data science techniques are classified in this study. This study shows the ability to predict the number of cases affected by COVID-19 as potential risk to mankind.


2021 ◽  
Vol 28 ◽  
Author(s):  
Martina Veit-Acosta ◽  
Walter Filgueira de Azevedo Junior

Background: CDK2 participates in the control of eukaryotic cell-cycle progression. Due to the great interest in CDK2 for drug development and the relative easiness in crystallizing this enzyme, we have over 400 structural studies focused on this protein target. This structural data is the basis for the development of computational models to estimate CDK2-ligand binding affinity. Objective: This work focuses on the recent developments in the application of supervised machine learning modeling to develop scoring functions to predict the binding affinity of CDK2. Method: We employed the structures available at the protein data bank and the ligand information accessed from the BindingDB, Binding MOAD, and PDBbind to evaluate the predictive performance of machine learning techniques combined with physical modeling used to calculate binding affinity. We compared this hybrid methodology with classical scoring functions available in docking programs. Results: Our comparative analysis of previously published models indicated that a model created using a combination of a mass-spring system and cross-validated Elastic Net to predict the binding affinity of CDK2-inhibitor complexes outperformed classical scoring functions available in AutoDock4 and AutoDock Vina. Conclusion: All studies reviewed here suggest that targeted machine learning models are superior to classical scoring functions to calculate binding affinities. Specifically for CDK2, we see that the combination of physical modeling with supervised machine learning techniques exhibits improved predictive performance to calculate the protein-ligand binding affinity. These results find theoretical support in the application of the concept of scoring function space.


2021 ◽  
Vol 20 (1) ◽  
Author(s):  
Domingos S. M. Andrade ◽  
Luigi Maciel Ribeiro ◽  
Agnaldo J. Lopes ◽  
Jorge L. M. Amaral ◽  
Pedro L. Melo

Abstract Introduction The use of machine learning (ML) methods would improve the diagnosis of respiratory changes in systemic sclerosis (SSc). This paper evaluates the performance of several ML algorithms associated with the respiratory oscillometry analysis to aid in the diagnostic of respiratory changes in SSc. We also find out the best configuration for this task. Methods Oscillometric and spirometric exams were performed in 82 individuals, including controls (n = 30) and patients with systemic sclerosis with normal (n = 22) and abnormal (n = 30) spirometry. Multiple instance classifiers and different supervised machine learning techniques were investigated, including k-Nearest Neighbors (KNN), Random Forests (RF), AdaBoost with decision trees (ADAB), and Extreme Gradient Boosting (XGB). Results and discussion The first experiment of this study showed that the best oscillometric parameter (BOP) was dynamic compliance, which provided moderate accuracy (AUC = 0.77) in the scenario control group versus patients with sclerosis and normal spirometry (CGvsPSNS). In the scenario control group versus patients with sclerosis and altered spirometry (CGvsPSAS), the BOP obtained high accuracy (AUC = 0.94). In the second experiment, the ML techniques were used. In CGvsPSNS, KNN achieved the best result (AUC = 0.90), significantly improving the accuracy in comparison with the BOP (p < 0.01), while in CGvsPSAS, RF obtained the best results (AUC = 0.97), also significantly improving the diagnostic accuracy (p < 0.05). In the third, fourth, fifth, and sixth experiments, different feature selection techniques allowed us to spot the best oscillometric parameters. They resulted in a small increase in diagnostic accuracy in CGvsPSNS (respectively, 0.87, 0.86, 0.82, and 0.84), while in the CGvsPSAS, the best classifier's performance remained the same (AUC = 0.97). Conclusions Oscillometric principles combined with machine learning algorithms provide a new method for diagnosing respiratory changes in patients with systemic sclerosis. The present study's findings provide evidence that this combination may help in the early diagnosis of respiratory changes in these patients.


Author(s):  
Tales Lima Fonseca ◽  
Yulia Gorodetskaya ◽  
Gisele Goulart Tavares ◽  
Celso Bandeira de Melo Ribeiro ◽  
Leonardo Goliatt da Fonseca

The short-term streamflow forecast is an important parameter in studies related to energy generation and the prediction of possible floods. Flowing through three Brazilian states, the Paraíba do Sul river is responsible for the supply and energy generation in several municipalities.  Machine learning techniques have been studied with the aim of improving these predictions through the use of hydrological and hydrometeorological parameters. Furthermore, the predictive performance of the machine learning techniques are directly related to the quality of the training base and, moreover, to the set of hyperparameters used. The present study explores the combination of the Gradient Boosting technique coupled with a Genetic Algorithm to found the best set of hyperparameter to maximize the predicting performance of the Paraíba do Sul river streamflow.


2020 ◽  
Author(s):  
Ghazal Farhani ◽  
Robert J. Sica ◽  
Mark Joseph Daley

Abstract. While it is relatively straightforward to automate the processing of lidar signals, it is more difficult to choose periods of "good" measurements to process. Groups use various ad hoc procedures involving either very simple (e.g. signal-to-noise ratio) or more complex procedures (e.g. Wing et al., 2018) to perform a task which is easy to train humans to perform but is time consuming. Here, we use machine learning techniques to train the machine to sort the measurements before processing. The presented methods is generic and can be applied to most lidars. We test the techniques using measurements from the Purple Crow Lidar (PCL) system located in London, Canada. The PCL has over 200,000 raw scans in Rayleigh and Raman channels available for classification. We classify raw (level-0) lidar measurements as "clear" sky scans with strong lidar returns, "bad" scans, and scans which are significantly influenced by clouds or aerosol loads. We examined different supervised machine learning algorithms including the random forest, the support vector machine, and the gradient boosting trees, all of which can successfully classify scans. The algorithms where trained using about 1500 scans for each PCL channel, selected randomly from different nights of measurements in different years. The success rate of identification, for all the channels is above 95 %. We also used the t-distributed Stochastic Embedding (t-SNE) method, which is an unsupervised algorithm, to cluster our lidar scans. Because the t-SNE is a data driven method in which no labelling of training set is needed, it is an attractive algorithm to find anomalies in lidar scans. The method has been tested on several nights of measurements from the PCL measurements.The t-SNE can successfully cluster the PCL data scans into meaningful categories. To demonstrate the use of the technique, we have used the algorithm to identify stratospheric aerosol layers due to wildfires.


Sign in / Sign up

Export Citation Format

Share Document