scholarly journals Introduction to Supervised Machine Learning for Data Science

2020 ◽  
Vol 26 (1) ◽  
pp. 87-121
Author(s):  
Mohammad Samy BALADRAM ◽  
Atsushi KOIKE ◽  
Kazunori D YAMADA
2021 ◽  
Author(s):  
Yuxiang Chen ◽  
Chuanlei Liu ◽  
Yang An ◽  
Yue Lou ◽  
Yang Zhao ◽  
...  

Machine learning and computer-aided approaches significantly accelerate molecular design and discovery in scientific and industrial fields increasingly relying on data science for efficiency. The typical method used is supervised learning which needs huge datasets. Semi-supervised machine learning approaches are effective to train unlabeled data with improved modeling performance, whereas they are limited by the accumulation of prediction errors. Here, to screen solvents for removal of methyl mercaptan, a type of organosulfur impurities in natural gas, we constructed a computational framework by integrating molecular similarity search and active learning methods, namely, molecular active selection machine learning (MASML). This new model framework identifies the optimal molecules set by molecular similarity search and iterative addition to the training dataset. Among all 126,068 compounds in the initial dataset, 3 molecules were identified to be promising for methyl mercaptan (MeSH) capture, including benzylamine (BZA), p-methoxybenzylamine (PZM), and N,N-diethyltrimethylenediamine (DEAPA). Further experiments confirmed the effectiveness of our modeling framework in efficient molecular design and identification for capturing methyl mercaptan, in which DEAPA presents a Henry's law constant 89.4% lower than that of methyl diethanolamine (MDEA).


Data Science in healthcare is a innovative and capable for industry implementing the data science applications. Data analytics is recent science in to discover the medical data set to explore and discover the disease. It’s a beginning attempt to identify the disease with the help of large amount of medical dataset. Using this data science methodology, it makes the user to find their disease without the help of health care centres. Healthcare and data science are often linked through finances as the industry attempts to reduce its expenses with the help of large amounts of data. Data science and medicine are rapidly developing, and it is important that they advance together. Health care information is very effective in the society. In a human life day to day heart disease had increased. Based on the heart disease to monitor different factors in human body to analyse and prevent the heart disease. To classify the factors using the machine learning algorithms and to predict the disease is major part. Major part of involves machine level based supervised learning algorithm such as SVM, Naviebayes, Decision Trees and Random forest.


Author(s):  
Renáta Németh ◽  
Fanni Máté ◽  
Eszter Katona ◽  
Márton Rakovics ◽  
Domonkos Sik

AbstractSupervised machine learning on textual data has successful industrial/business applications, but it is an open question whether it can be utilized in social knowledge building outside the scope of hermeneutically more trivial cases. Combining sociology and data science raises several methodological and epistemological questions. In our study the discursive framing of depression is explored in online health communities. Three discursive frameworks are introduced: the bio-medical, psychological, and social framings of depression. ~80 000 posts were collected, and a sample of them was manually classified. Conventional bag-of-words models, Gradient Boosting Machine, word-embedding-based models and a state-of-the-art Transformer-based model with transfer learning, called DistilBERT were applied to expand this classification on the whole database. According to our experience ‘discursive framing’ proves to be a complex and hermeneutically difficult concept, which affects the degree of both inter-annotator agreement and predictive performance. Our finding confirms that the level of inter-annotator disagreement provides a good estimate for the objective difficulty of the classification. By identifying the most important terms, we also interpreted the classification algorithms, which is of great importance in social sciences. We are convinced that machine learning techniques can extend the horizon of qualitative text analysis. Our paper supports a smooth fit of the new techniques into the traditional toolbox of social sciences.


2021 ◽  
Vol 309 ◽  
pp. 01218
Author(s):  
P. Lakshmi Sruthi ◽  
K. Butchi Raju

COVID-19 is a global epidemic that has spread to over 170 nations. In practically all of the countries affected, the number of infected and death cases has been rising rapidly. Forecasting approaches can be implemented, resulting in the development of more effective strategies and the making of more informed judgments. These strategies examine historical data in order to make more accurate predictions about what will happen in the future. These forecasts could aid in preparing for potential risks and consequences. In order to create accurate findings, forecasting techniques are crucial. Forecasting strategies based on Big data analytics acquired from National databases (or) World Health Organization, as well as machine learning (or) data science techniques are classified in this study. This study shows the ability to predict the number of cases affected by COVID-19 as potential risk to mankind.


2021 ◽  
Author(s):  
Leonardo Deiss ◽  
Shameema Oottikkal ◽  
Karen Tomko ◽  
Wanyu Huang ◽  
Steve Culman ◽  
...  

<p>Soil infrared spectroscopy has great potential for estimating soil properties, but reference soil measurements are typically required in combination with multivariate statistical models to estimate soil properties. User-friendly predictive tools based on open-source statistical environment remain one of the main limitations to enable technology diffusion to non-specialist users. Our aim is to build capacity for an automated machine learning routine for rapid and robust prediction of soil health indicators using lab acquired soil infrared spectra. This intelligent system runs on R statistical environment and includes (1) a diverse soil spectral library comprising main physiographic regions from the USA Midwest region under diverse land uses and various sampling depths, (2) a classification process to detect potential outliers in newly acquired spectra using supervised machine learning techniques, and (3) a multi-model optimized prediction process based on linear and non-linear statistical procedures (partial least squares, support vector machines, and neural network). This prediction system works at the intersection of soil and data science and high-performance computing to enable efficient parallel processing of spectral data on multi-core coprocessors. Using artificial intelligence to automate soil infrared spectroscopy is a fundamental demand that will make this technique an effective routine in soil laboratories to estimate soil health.</p>


Author(s):  
Xiaoling Xiang ◽  
Xuan Lu ◽  
Alex Halavanau ◽  
Jia Xue ◽  
Yihang Sun ◽  
...  

Abstract Objectives This study examined public discourse and sentiment regarding older adults and COVID-19 on social media and assessed the extent of ageism in public discourse. Methods Twitter data (N = 82,893) related to both older adults and COVID-19 and dated from January 23 to May 20, 2020, were analyzed. We used a combination of data science methods (including supervised machine learning, topic modeling, and sentiment analysis), qualitative thematic analysis, and conventional statistics. Results The most common category in the coded tweets was “personal opinions” (66.2%), followed by “informative” (24.7%), “jokes/ridicule” (4.8%), and “personal experiences” (4.3%). The daily average of ageist content was 18%, with the highest of 52.8% on March 11, 2020. Specifically, more than 1 in 10 (11.5%) tweets implied that the life of older adults is less valuable or downplayed the pandemic because it mostly harms older adults. A small proportion (4.6%) explicitly supported the idea of just isolating older adults. Almost three-quarters (72.9%) within “jokes/ridicule” targeted older adults, half of which were “death jokes.” Also, 14 themes were extracted, such as perceptions of lockdown and risk. A bivariate Granger causality test suggested that informative tweets regarding at-risk populations increased the prevalence of tweets that downplayed the pandemic. Discussion Ageist content in the context of COVID-19 was prevalent on Twitter. Information about COVID-19 on Twitter influenced public perceptions of risk and acceptable ways of controlling the pandemic. Public education on the risk of severe illness is needed to correct misperceptions.


2020 ◽  
Vol 14 (2) ◽  
pp. 140-159
Author(s):  
Anthony-Paul Cooper ◽  
Emmanuel Awuni Kolog ◽  
Erkki Sutinen

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


Sign in / Sign up

Export Citation Format

Share Document