Introduction to Machine Learning for Chemists: An Undergraduate Course Using Python Notebooks for Visualization, Data Processing, Data Analysis, and Data Modeling

Machine Learning, a subdomain of Artificial intelligence, is a pervasive technology that would mold how chemists interact with data. Therefore, it is a relevant skill to incorporate into the toolbox of any chemistry student. This work presents a course that introduces machine learning for chemistry students based on a set of Python Notebooks and assignments. Python language, one of the most popular programming languages, allows for free software and resources, which ensures availability. The course is constructed for students without previous experience in programming, leading to an incremental progression in depth and complexity that covers both programming and machine learning concepts. The examples used are related to real data from physicochemical characterizations of wines, producing an attractive material that captures the interest of students. Topics included are Introduction to Python, Basic Statistics, Data Visualization and Dimension Reduction, Classification, and Regression.

Download Full-text

Introduction to Machine Learning for Chemists: An Undergraduate Course Using Python Notebooks for Visualization, Data Processing, Data Analysis, and Data Modeling

10.26434/chemrxiv.13749199.v1 ◽

2021 ◽

Author(s):

Deborah Lafuente ◽

Brenda Cohen ◽

Guillermo Fiorini ◽

Agustín García ◽

Mauro Bringas ◽

...

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Real Data ◽

Free Software ◽

Chemistry Student ◽

Chemistry Students ◽

Pervasive Technology ◽

Python Language ◽

Processing Data ◽

Depth And Complexity

Download Full-text

A Review on Machine-Learning Based Code Smell Detection Techniquesin Object-Oriented Software System(s)

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096513999200922125839 ◽

2020 ◽

Vol 13 ◽

Author(s):

Amandeep Kaur ◽

Sushma Jain ◽

Shivani Goel ◽

Gaurav Dhiman

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Software Quality ◽

Empirical Studies ◽

Statistical Testing ◽

Machine Learning Techniques ◽

Support Vector ◽

Code Smells ◽

Detection Techniques ◽

Code Smell

Context: Code smells are symptoms, that something may be wrong in software systems that can cause complications in maintaining software quality. In literature, there exists many code smells and their identification is far from trivial. Thus, several techniques have also been proposed to automate code smell detection in order to improve software quality. Objective: This paper presents an up-to-date review of simple and hybrid machine learning based code smell detection techniques and tools. Methods: We collected all the relevant research published in this field till 2020. We extracted the data from those articles and classified them into two major categories. In addition, we compared the selected studies based on several aspects like, code smells, machine learning techniques, datasets, programming languages used by datasets, dataset size, evaluation approach, and statistical testing. Results: Majority of empirical studies have proposed machine- learning based code smell detection tools. Support vector machine and decision tree algorithms are frequently used by the researchers. Along with this, a major proportion of research is conducted on Open Source Softwares (OSS) such as, Xerces, Gantt Project and ArgoUml. Furthermore, researchers paid more attention towards Feature Envy and Long Method code smells. Conclusion: We identified several areas of open research like, need of code smell detection techniques using hybrid approaches, need of validation employing industrial datasets, etc.

Download Full-text

Machine Learning for the Dynamic Positioning of UAVs for Extended Connectivity

Sensors ◽

10.3390/s21134618 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4618

Author(s):

Francisco Oliveira ◽

Miguel Luís ◽

Susana Sargento

Keyword(s):

Machine Learning ◽

Cellular Networks ◽

Real Data ◽

Emerging Technology ◽

Machine Learning Algorithms ◽

Base Stations ◽

Aerial Vehicle ◽

Positioning Algorithm ◽

The Military ◽

Better Than

Unmanned Aerial Vehicle (UAV) networks are an emerging technology, useful not only for the military, but also for public and civil purposes. Their versatility provides advantages in situations where an existing network cannot support all requirements of its users, either because of an exceptionally big number of users, or because of the failure of one or more ground base stations. Networks of UAVs can reinforce these cellular networks where needed, redirecting the traffic to available ground stations. Using machine learning algorithms to predict overloaded traffic areas, we propose a UAV positioning algorithm responsible for determining suitable positions for the UAVs, with the objective of a more balanced redistribution of traffic, to avoid saturated base stations and decrease the number of users without a connection. The tests performed with real data of user connections through base stations show that, in less restrictive network conditions, the algorithm to dynamically place the UAVs performs significantly better than in more restrictive conditions, reducing significantly the number of users without a connection. We also conclude that the accuracy of the prediction is a very important factor, not only in the reduction of users without a connection, but also on the number of UAVs deployed.

Download Full-text

Application of a Rough Set-Based Inductive Learning System

Fundamenta Informaticae ◽

10.3233/fi-1993-182-409 ◽

1993 ◽

Vol 18 (2-4) ◽

pp. 209-220

Author(s):

Michael Hadjimichael ◽

Anita Wasilewska

Keyword(s):

Machine Learning ◽

Rough Set ◽

Presidential Election ◽

Predictive Accuracy ◽

Learning Algorithm ◽

Inductive Learning ◽

Real Data ◽

Semantic Content ◽

Learning System ◽

Voter Preferences

We present here an application of Rough Set formalism to Machine Learning. The resulting Inductive Learning algorithm is described, and its application to a set of real data is examined. The data consists of a survey of voter preferences taken during the 1988 presidential election in the U.S.A. Results include an analysis of the predictive accuracy of the generated rules, and an analysis of the semantic content of the rules.

Download Full-text

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

Download Full-text

Impact of programming languages on machine learning bugs

Proceedings of the 1st ACM International Workshop on AI and Software Testing/Analysis ◽

10.1145/3464968.3468408 ◽

2021 ◽

Author(s):

Sebastian Sztwiertnia ◽

Maximilian Grübel ◽

Amine Chouchane ◽

Daniel Sokolowski ◽

Krishna Narasimhan ◽

...

Keyword(s):

Machine Learning ◽

Programming Languages

Download Full-text

Machine learning phases and criticalities without using real data for training

Physical Review B ◽

10.1103/physrevb.102.224434 ◽

2020 ◽

Vol 102 (22) ◽

Author(s):

D.-R. Tan ◽

F.-J. Jiang

Keyword(s):

Machine Learning ◽

Real Data

Download Full-text

Modeling of Psychomotor Reactions of a Person Based on Modification of the Tapping Test

International Journal of Computing ◽

10.47839/ijc.20.2.2166 ◽

2021 ◽

pp. 190-200

Author(s):

Lesia Mochurad ◽

Yaroslav Hladun

Keyword(s):

Neural Network ◽

Machine Learning ◽

Time Series ◽

Real Data ◽

Finger Tapping ◽

Similar Distribution ◽

Model Learning ◽

Machine Learning Model ◽

Finger Tapping Test

The paper considers the method for analysis of a psychophysical state of a person on psychomotor indicators – finger tapping test. The app for mobile phone that generalizes the classic tapping test is developed for experiments. Developed tool allows collecting samples and analyzing them like individual experiments and like dataset as a whole. The data based on statistical methods and optimization of hyperparameters is investigated for anomalies, and an algorithm for reducing their number is developed. The machine learning model is used to predict different features of the dataset. These experiments demonstrate the data structure obtained using finger tapping test. As a result, we gained knowledge of how to conduct experiments for better generalization of the model in future. A method for removing anomalies is developed and it can be used in further research to increase an accuracy of the model. Developed model is a multilayer recurrent neural network that works well with the classification of time series. Error of model learning on a synthetic dataset is 1.5% and on a real data from similar distribution is 5%.

Download Full-text

Implementation of Selection Sort Algorithm in Various Programming Languages

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/1071032021 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2249-2255

Keyword(s):

Programming Languages ◽

Data Science ◽

Efficient Algorithms ◽

Huge Number ◽

Running Time ◽

Fast Running ◽

Python Language ◽

Sort Algorithm

Sorting algorithmdeals with the arrangement of alphanumeric data in static order.It plays an important roleinthe field of data science. Selection sort is one ofthe simplest and efficient algorithms which can be applied for the huge number of elements it works likeby giving list of unsorted information, the calculation which breaksintotwo partitions. One section has all the sorted information and another sectionhas all thestaying unsorted information. The calculation rehashes itself, by finding the smallestcomponentinside the rundown of unsorted information and swappingitwith the furthest left component, in the end setting everything straight information.This researchpresents the implementationof selection sort usingC/C++, Python, and Rust and measuredthetime complexity. After experiment,we have collectedtheresults in terms of running time, andanalyzed the outcomes.It was observed that python language hasvery smallamount of line of code, and it also consumesless storage and fast running time then other two languages.

Download Full-text

An Experimental Study of Spammer Detection on Chinese Microblogs

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s021819402040029x ◽

2020 ◽

Vol 30 (11n12) ◽

pp. 1759-1777

Author(s):

Jialing Liang ◽

Peiquan Jin ◽

Lin Mu ◽

Jie Zhao

Keyword(s):

Machine Learning ◽

Social Media ◽

User Behavior ◽

Real Data ◽

User Profile ◽

Data Set ◽

Sina Weibo ◽

Factors Affecting ◽

The Government ◽

Hot Event

With the development of Web 2.0, social media such as Twitter and Sina Weibo have become an essential platform for disseminating hot events. Simultaneously, due to the free policy of microblogging services, users can post user-generated content freely on microblogging platforms. Accordingly, more and more hot events on microblogging platforms have been labeled as spammers. Spammers will not only hurt the healthy development of social media but also introduce many economic and social problems. Therefore, the government and enterprises must distinguish whether a hot event on microblogging platforms is a spammer or is a naturally-developing event. In this paper, we focus on the hot event list on Sina Weibo and collect the relevant microblogs of each hot event to study the detecting methods of spammers. Notably, we develop an integral feature set consisting of user profile, user behavior, and user relationships to reflect various factors affecting the detection of spammers. Then, we employ typical machine learning methods to conduct extensive experiments on detecting spammers. We use a real data set crawled from the most prominent Chinese microblogging platform, Sina Weibo, and evaluate the performance of 10 machine learning models with five sampling methods. The results in terms of various metrics show that the Random Forest model and the over-sampling method achieve the best accuracy in detecting spammers and non-spammers.

Download Full-text