Machine Learning Modeling: A New Way to do Quantitative Research in Social Sciences in the Era of AI

Limiting Factors ◽

Learning Models ◽

New Methods ◽

The Social ◽

High Level ◽

Improvements in big data and machine learning algorithms have helped AI technologies reach a new breakthrough and have provided a new opportunity for quantitative research in the social sciences. Traditional quantitative models rely heavily on theoretical hypotheses and statistics but fail to acknowledge the problem of overfitting, causing the research results to be less generalizable, and further leading to societal predictions in the social sciences being ignored when they should have been meaningful. Machine learning models that use cross validation and regularization can effectively solve the problem of overfitting, providing support for the societal predictions based on these models. This paper first discusses the sources and internal mechanisms of overfitting, and then introduces machine learning modeling by discussing its high-level ideas, goals, and concrete methods. Finally, we discuss the shortcomings and limiting factors of machine learning models. We believe that using machine learning in social sciences research is an opportunity and not a threat. Researchers should adopt an objective attitude and make sure that they know how to combine traditional methods with new methods in their research based on their needs.

Semantic Systems. In the Era of Knowledge Graphs - Lecture Notes in Computer Science ◽

Semantic Annotation, Representation and Linking of Survey Data

10.1007/978-3-030-59833-4_4 ◽

2020 ◽

pp. 53-69

Author(s):

Felix Bensmann ◽

Andrea Papenmeier ◽

Dagmar Kern ◽

Benjamin Zapilko ◽

Stefan Dietze

Keyword(s):

Social Sciences ◽

Machine Learning ◽

Keyword Search ◽

Semantic Annotation ◽

Experimental Results ◽

Learning Models ◽

Faceted Search ◽

Ongoing Work ◽

The Social ◽

Abstract Semantic technologies offer significant potential for improving data search applications. Ongoing work thrives to equip data catalogs with new semantic search features to supplement existing keyword search and browsing capabilities. In particular within the social sciences, searching and reusing data is essential to foster efficient research. In this paper, we introduce an approach and experimental results aimed at improving interoperability and findability of social sciences survey items. Our contributions include a conceptual model for semantically representing survey items and questions, detailing meaningful dimensions of items, as well as experimental results geared towards the automated prediction of such item features using state-of-the-art machine learning models. Dimensions of interest include, for instance, references to geolocation and time periods or the scope and style of particular questions. We define classification tasks using neural and traditional machine learning models combined with sentence structure features. Applications of our work include semantic and faceted search for questions as part of our GESIS Search. We also provide the lifted data as a knowledge graph via a SPARQL endpoint for further reuse and sharing.

Edge Machine Learning for AI-Enabled IoT Devices: A Review

Sensors ◽

10.3390/s20092533 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2533 ◽

Cited By ~ 6

Author(s):

Massimo Merenda ◽

Carlo Porcaro ◽

Demetrio Iero

Keyword(s):

Machine Learning ◽

Internet Of Things ◽

The Internet ◽

Learning Models ◽

Iot Devices ◽

High Level ◽

And Behavior ◽

The Internet Of Things ◽

In a few years, the world will be populated by billions of connected devices that will be placed in our homes, cities, vehicles, and industries. Devices with limited resources will interact with the surrounding environment and users. Many of these devices will be based on machine learning models to decode meaning and behavior behind sensors’ data, to implement accurate predictions and make decisions. The bottleneck will be the high level of connected things that could congest the network. Hence, the need to incorporate intelligence on end devices using machine learning algorithms. Deploying machine learning on such edge devices improves the network congestion by allowing computations to be performed close to the data sources. The aim of this work is to provide a review of the main techniques that guarantee the execution of machine learning models on hardware with low performances in the Internet of Things paradigm, paving the way to the Internet of Conscious Things. In this work, a detailed review on models, architecture, and requirements on solutions that implement edge machine learning on Internet of Things devices is presented, with the main goal to define the state of the art and envisioning development requirements. Furthermore, an example of edge machine learning implementation on a microcontroller will be provided, commonly regarded as the machine learning “Hello World”.

Prediction of aircraft estimated time of arrival using machine learning methods

The Aeronautical Journal ◽

10.1017/aer.2021.13 ◽

2021 ◽

pp. 1-15

Author(s):

O. Basturk ◽

C. Cetek

Keyword(s):

Machine Learning ◽

Web Application ◽

Absolute Error ◽

Weather Data ◽

Time Of Arrival ◽

Learning Models ◽

Trajectory Data ◽

Different Sources ◽

ABSTRACT In this study, prediction of aircraft Estimated Time of Arrival (ETA) is proposed using machine learning algorithms. Accurate prediction of ETA is important for management of delay and air traffic flow, runway assignment, gate assignment, collaborative decision making (CDM), coordination of ground personnel and equipment, and optimisation of arrival sequence etc. Machine learning is able to learn from experience and make predictions with weak assumptions or no assumptions at all. In the proposed approach, general flight information, trajectory data and weather data were obtained from different sources in various formats. Raw data were converted to tidy data and inserted into a relational database. To obtain the features for training the machine learning models, the data were explored, cleaned and transformed into convenient features. New features were also derived from the available data. Random forests and deep neural networks were used to train the machine learning models. Both models can predict the ETA with a mean absolute error (MAE) less than 6min after departure, and less than 3min after terminal manoeuvring area (TMA) entrance. Additionally, a web application was developed to dynamically predict the ETA using proposed models.

Characterizing and Evaluating the Zoonotic Potential of Novel Viruses Discovered in Vampire Bats

Viruses ◽

10.3390/v13020252 ◽

2021 ◽

Vol 13 (2) ◽

pp. 252

Author(s):

Laura M. Bergner ◽

Nardus Mollentze ◽

Richard J. Orton ◽

Carlos Tello ◽

Alice Broos ◽

...

Keyword(s):

Machine Learning ◽

Phylogenetic Analyses ◽

Human Infection ◽

Zoonotic Potential ◽

Metagenomic Sequencing ◽

Learning Models ◽

Sequencing Data ◽

Vampire Bats ◽

The contemporary surge in metagenomic sequencing has transformed knowledge of viral diversity in wildlife. However, evaluating which newly discovered viruses pose sufficient risk of infecting humans to merit detailed laboratory characterization and surveillance remains largely speculative. Machine learning algorithms have been developed to address this imbalance by ranking the relative likelihood of human infection based on viral genome sequences, but are not yet routinely applied to viruses at the time of their discovery. Here, we characterized viral genomes detected through metagenomic sequencing of feces and saliva from common vampire bats (Desmodus rotundus) and used these data as a case study in evaluating zoonotic potential using molecular sequencing data. Of 58 detected viral families, including 17 which infect mammals, the only known zoonosis detected was rabies virus; however, additional genomes were detected from the families Hepeviridae, Coronaviridae, Reoviridae, Astroviridae and Picornaviridae, all of which contain human-infecting species. In phylogenetic analyses, novel vampire bat viruses most frequently grouped with other bat viruses that are not currently known to infect humans. In agreement, machine learning models built from only phylogenetic information ranked all novel viruses similarly, yielding little insight into zoonotic potential. In contrast, genome composition-based machine learning models estimated different levels of zoonotic potential, even for closely related viruses, categorizing one out of four detected hepeviruses and two out of three picornaviruses as having high priority for further research. We highlight the value of evaluating zoonotic potential beyond ad hoc consideration of phylogeny and provide surveillance recommendations for novel viruses in a wildlife host which has frequent contact with humans and domestic animals.

Explainable AI: A Review of Machine Learning Interpretability Methods

Entropy ◽

10.3390/e23010018 ◽

2020 ◽

Vol 23 (1) ◽

pp. 18

Author(s):

Pantelis Linardatos ◽

Vasilis Papastefanopoulos ◽

Sotiris Kotsiantis

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Black Box ◽

Learning Systems ◽

Model Complexity ◽

Learning Models ◽

New Methods ◽

Industrial Adoption ◽

Machine Learning Models ◽

The Way

Recent advances in artificial intelligence (AI) have led to its widespread industrial adoption, with machine learning systems demonstrating superhuman performance in a significant number of tasks. However, this surge in performance, has often been achieved through increased model complexity, turning such systems into “black box” approaches and causing uncertainty regarding the way they operate and, ultimately, the way that they come to decisions. This ambiguity has made it problematic for machine learning systems to be adopted in sensitive yet critical domains, where their value could be immense, such as healthcare. As a result, scientific interest in the field of Explainable Artificial Intelligence (XAI), a field that is concerned with the development of new methods that explain and interpret machine learning models, has been tremendously reignited over recent years. This study focuses on machine learning interpretability methods; more specifically, a literature review and taxonomy of these methods are presented, as well as links to their programming implementations, in the hope that this survey would serve as a reference point for both theorists and practitioners.

Model Comparison for Esp Run-Life Prediction: Classic Statistics Vs. Machine Learning

10.2118/206028-ms ◽

2021 ◽

Author(s):

Alejandro Celemín ◽

Diego A. Estupiñan ◽

Ricardo Nieto

Keyword(s):

Machine Learning ◽

Model Comparison ◽

Proportional Hazards ◽

Proportional Hazards Model ◽

Slight Reduction ◽

Learning Models ◽

Validation Data ◽

Operational Conditions ◽

Abstract Electrical Submersible Pumps reliability and run-life analysis has been extensively studied since its development. Current machine learning algorithms allow to correlate operational conditions to ESP run-life in order to generate predictions for active and new wells. Four machine learning models are compared to a linear proportional hazards model, used as a baseline for comparison purposes. Proper accuracy metrics for survival analysis problems are calculated on run-life predictions vs. actual values over training and validation data subsets. Results demonstrate that the baseline model is able to produce more consistent predictions with a slight reduction in its accuracy, compared to current machine learning models for small datasets. This study demonstrates that the quality of the date and it pre-processing supports the current shift from model-centric to data-centric approach to machine and deep learning problems.

Machine Learning Models for Finger Bend Evaluation using Implemented Low cost Flex Sensor

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35742 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 3605-3611

Author(s):

Pratyush Kaware

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Low Cost ◽

Learning Algorithms ◽

Cost Effective ◽

Support Vector ◽

Learning Models ◽

In this paper a cost-effective sensor has been implemented to read finger bend signals, by attaching the sensor to a finger, so as to classify them based on the degree of bent as well as the joint about which the finger was being bent. This was done by testing with various machine learning algorithms to get the most accurate and consistent classifier. Finally, we found that Support Vector Machine was the best algorithm suited to classify our data, using we were able predict live state of a finger, i.e., the degree of bent and the joints involved. The live voltage values from the sensor were transmitted using a NodeMCU micro-controller which were converted to digital and uploaded on a database for analysis.

Comparison of the Performance of Machine Learning Algorithms in Predicting Heart Disease

Frontiers in Health Informatics ◽

10.30699/fhi.v10i1.349 ◽

2021 ◽

Vol 10 (1) ◽

pp. 99

Author(s):

Sajad Yousefi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Decision Tree ◽

Roc Curve ◽

Supervised Machine Learning ◽

Learning Models ◽

Algorithm Performance ◽

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.

Application of Machine Learning Methods in the Task of Identifying User Accounts in Two Social Networks

Computer Tools in Education ◽

10.32603/2071-2340-2019-3-29-43 ◽

2019 ◽

pp. 29-43

Author(s):

Anastasiya A. Korepanova ◽

◽

Valerii D. Oliseenko ◽

Maxim V. Abramov ◽

Alexander L. Tulupyev ◽

...

Keyword(s):

Machine Learning ◽

Social Networks ◽

Information System ◽

New Combination ◽

Practical Significance ◽

User Profiles ◽

Learning Models ◽

Machine Learning Methods ◽

The Social ◽

The article describes the approach to solving the problem of comparing user profiles of different social networks and identifying those that belong to one person. An appropriate method is proposed based on a comparison of the social environment and the values of account profile attributes in two different social networks. The results of applying various machine learning models to solving this problem are compared. The novelty of the approach lies in the proposed new combination of various methods and application to new social networks. The practical significance of the study is to automate the process of determining the ownership of profiles in various social networks to one user. These results can be applied in the task of constructing a meta-profile of a user of an information system for the subsequent construction of a profile of his vulnerabilities, as well as in other studies devoted to social networks.

Tissue-guided LASSO for prediction of clinical drug response using preclinical samples

10.1101/724310 ◽

2019 ◽

Author(s):

Edward W Huang ◽

Ameya Bhope ◽

Jing Lim ◽

Saurabh Sinha ◽

Amin Emad

Keyword(s):

Machine Learning ◽

Drug Response ◽

Drug Sensitivity ◽

Learning Models ◽

Large Databases ◽

Tissue Of Origin ◽

Molecular Profiles ◽

Clinical Drug ◽

ABSTRACTPrediction of clinical drug response (CDR) of cancer patients, based on their clinical and molecular profiles obtained prior to administration of the drug, can play a significant role in individualized medicine. Machine learning models have the potential to address this issue, but training them requires data from a large number of patients treated with each drug, limiting their feasibility. While large databases of drug response and molecular profiles of preclinical in-vitro cancer cell lines (CCLs) exist for many drugs, it is unclear whether preclinical samples can be used to predict CDR of real patients.We designed a systematic approach to evaluate how well different algorithms, trained on gene expression and drug response of CCLs, can predict CDR of patients. Using data from two large databases, we evaluated various linear and non-linear algorithms, some of which utilized information on gene interactions. Then, we developed a new algorithm called TG-LASSO that explicitly integrates information on samples’ tissue of origin with gene expression profiles to improve prediction performance. Our results showed that regularized regression methods provide significantly accurate prediction. However, including the network information or common methods of including information on the tissue of origin did not improve the results. On the other hand, TG-LASSO improved the predictions and distinguished resistant and sensitive patients for 7 out of 13 drugs. Additionally, TG-LASSO identified genes associated with the drug response, including known targets and pathways involved in the drugs’ mechanism of action. Moreover, genes identified by TG-LASSO for multiple drugs in a tissue were associated with patient survival. In summary, our analysis suggests that preclinical samples can be used to predict CDR of patients and identify biomarkers of drug sensitivity and survival.AUTHOR SUMMARYCancer is among the leading causes of death globally and perdition of the drug response of patients to different treatments based on their clinical and molecular profiles can enable individualized cancer medicine. Machine learning algorithms have the potential to play a significant role in this task; but, these algorithms are designed based the premise that a large number of labeled training samples are available, and these samples are accurate representation of the profiles of real tumors. However, due to ethical and technical reasons, it is not possible to screen humans for many drugs, significantly limiting the size of training data. To overcome this data scarcity problem, machine learning models can be trained using large databases of preclinical samples (e.g. cancer cell line cultures). However, due to the major differences between preclinical samples and real tumors, it is unclear how accurately such preclinical-to-clinical computational models can predict the clinical drug response of cancer patients.Here, first we systematically evaluate a variety of different linear and nonlinear machine learning algorithms for this particular task using two large databases of preclinical (GDSC) and tumor samples (TCGA). Then, we present a novel method called TG-LASSO that utilizes a new approach for explicitly incorporating the tissue of origin of samples in the prediction task. Our results show that TG-LASSO outperforms all other algorithms and can accurately distinguish resistant and sensitive patients for the majority of the tested drugs. Follow-up analysis reveal that this method can also identify biomarkers of drug sensitivity in each cancer type.