Workforce Analytics: A Data-Driven Machine Learning Approach to Predict Job Change of Data Scientists

2021 ◽  
Vol 9 (2) ◽  
pp. 335-349
Author(s):  
Sohini Sengupta ◽  
Sareeta Mugde ◽  
Renuka Deshpande ◽  
Kimaya Potdar

Today the total amount of data created, captured, and consumed in the world is increasing at a rapid rate, as digitally driven organizations continue to contribute to the ever- growing global data sphere. (Holst, Statista Report 2020). This data brings with it a plethora of opportunities for organizations across different sectors. Hence, their hiring outlook is shifting towards candidates who possess the abilities to decode data and generate actionable insights to gain a competitive advantage. A career in data science offers great scope and the demand for such candidates is expected to rise steeply. When companies hire for big data and data science roles, they often provide training. From an HR perspective, it is important to understand how many of them would work for the company in the future or how many look at the training with an upskilling perspective for new jobs. HR has the aim of reducing costs and time required to conduct trainings by designing courses aligning to the candidate’s interest and needs. In this paper, we explored the data based on features including demographics, education and prior experience of the candidates. We made use of machine learning algorithms, viz. Logistic Regression, Naive Bayes, K Nearest-Neighbours Classifier, Decision Trees, Random Forest, Support Vector Machine, Gradient Descent Boosting, and XGBoost to predict whether a candidate will look for a new job or will stay and work for the company. 

PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0241239
Author(s):  
Kai On Wong ◽  
Osmar R. Zaïane ◽  
Faith G. Davis ◽  
Yutaka Yasui

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.


2021 ◽  
Author(s):  
Urmi Ghosh ◽  
Tuhin Chakraborty

<p>Rapid technological improvements made in in-situ analysis techniques, including LA-ICPMS, have transformed the field of analytical geochemistry. This has a far-reaching impact for different petrogenetic and ore-genetic studies where minute major and trace element compositional changes between different mineral zones within a single crystal can now be demarcated. Minerals such as garnet although robust are quite sensitive to the changing P-T and fluid conditions during their formation. These minerals have become powerful tools to characterize mineralization types. Previously, Meinert (1992) has used in-situ major element EPMA analysis results to classify different skarn deposit based on the end-member composition of hydrothermal garnets. Alternatively, Tian et al. (2019) used the garnet trace element composition for the similar purpose. However, these discrimination plots/ classification schemes show major overlap in different skarn deposits, such as Fe, Cu, Zn, and Au. The present study is an attempt to use machine learning approach on available garnet data to found a more potent classification scheme for skarn deposits, thus reaffirming garnet as a faithful indicator for hydrothermal ore deposits. We have meticulously collected major and trace element data of Ca-rich garnets, associated with different skarn deposits worldwide from 40 publications. This collected data is then used to train a model for fingerprinting the skarn deposits. Stratified random sampling method has been used on the dataset with 80% of the samples as test set and the rest 20 % as training dataset. We have used K-nearest neighbour (KNN), Support Vector Machine (SVM) and Random Forest algorithms on the data by using Python as a platform. These ML classification algorithm performs better than the earlier existing models available for classification of ore types based on garnet composition in skarn system. Factor importance is calculated that shows which elements play a pivotal role in classification of the ore type. Our results depict that multiple garnet forming elements taken together can reliably be used to discriminate between different ore formation settings.</p>


Author(s):  
Marco A. Alvarez ◽  
SeungJin Lim

Current search engines impose an overhead to motivated students and Internet users who employ the Web as a valuable resource for education. The user, searching for good educational materials for a technical subject, often spends extra time to filter irrelevant pages or ends up with commercial advertisements. It would be ideal if, given a technical subject by user who is educationally motivated, suitable materials with respect to the given subject are automatically identified by an affordable machine processing of the recommendation set returned by a search engine for the subject. In this scenario, the user can save a significant amount of time in filtering out less useful Web pages, and subsequently the user’s learning goal on the subject can be achieved more efficiently without clicking through numerous pages. This type of convenient learning is called One-Stop Learning (OSL). In this paper, the contributions made by Lim and Ko in (Lim and Ko, 2006) for OSL are redefined and modeled using machine learning algorithms. Four selected supervised learning algorithms: Support Vector Machine (SVM), AdaBoost, Naive Bayes and Neural Networks are evaluated using the same data used in (Lim and Ko, 2006). The results presented in this paper are promising, where the highest precision (98.9%) and overall accuracy (96.7%) obtained by using SVM is superior to the results presented by Lim and Ko. Furthermore, the machine learning approach presented here, demonstrates that the small set of features used to represent each Web page yields a good solution for the OSL problem.


Author(s):  
Rachaell Nihalaani

Abstract: As Plato once rightfully said, ‘Music gives a soul to the universe, wings to the mind, flight to the imagination and life to everything.’ Music has always been an important art form, and more so in today’s science-driven world. Music genre classification paves the way for other applications such as music recommender models. Several approaches could be used to classify music genres. In this literature, we aimed to build a machine learning model to classify the genre of an input audio file using 8 machine learning algorithms and determine which algorithm is the best suitable for genre classification. We have obtained an accuracy of 91% using the XGBoost algorithm. Keywords: Machine Learning, Music Genre Classification, Decision Trees, K Nearest Neighbours, Logistic regression, Naïve Bayes, Neural Networks, Random Forest, Support Vector Machine, XGBoost


Author(s):  
Ganesh K. Shinde

Abstract: Sentiment Analysis has improvement in online shopping platforms, scientific surveys from political polls, business intelligence, etc. In this we trying to analyse the twitter posts about Hashtag like #MakeinIndia using Machine Learning approach. By doing opinion mining in a specific area, it is possible to identify the effect of area information in sentiment analysis. We put forth a feature vector for classifying the tweets as positive, negative and neutral. After that applied machine learning algorithms namely: MaxEnt and SVM. We utilised Unigram, Bigram and Trigram Features to generate a set of features to train a linear MaxEnt and SVM classifiers. In the end we have measured the performance of classifier in terms of overall accuracy. Keywords: Sentiment analysis, support vector machine, maximum entropy, N-gram, Machine Learning


2020 ◽  
Vol 8 (6) ◽  
pp. 4684-4688

Per the statistics received from BBC, data varies for every earthquake occurred till date. Approximately, up to thousands are dead, about 50,000 are injured, around 1-3 Million are dislocated, while a significant amount go missing and homeless. Almost 100% structural damage is experienced. It also affects the economic loss, varying from 10 to 16 million dollars. A magnitude corresponding to 5 and above is classified as deadliest. The most life-threatening earthquake occurred till date took place in Indonesia where about 3 million were dead, 1-2 million were injured and the structural damage accounted to 100%. Hence, the consequences of earthquake are devastating and are not limited to loss and damage of living as well as nonliving, but it also causes significant amount of change-from surrounding and lifestyle to economic. Every such parameter desiderates into forecasting earthquake. A couple of minutes’ notice and individuals can act to shield themselves from damage and demise; can decrease harm and monetary misfortunes, and property, characteristic assets can be secured. In current scenario, an accurate forecaster is designed and developed, a system that will forecast the catastrophe. It focuses on detecting early signs of earthquake by using machine learning algorithms. System is entitled to basic steps of developing learning systems along with life cycle of data science. Data-sets for Indian sub-continental along with rest of the World are collected from government sources. Pre-processing of data is followed by construction of stacking model that combines Random Forest and Support Vector Machine Algorithms. Algorithms develop this mathematical model reliant on “training data-set”. Model looks for pattern that leads to catastrophe and adapt to it in its building, so as to settle on choices and forecasts without being expressly customized to play out the task. After forecast, we broadcast the message to government officials and across various platforms. The focus of information to obtain is keenly represented by the 3 factors – Time, Locality and Magnitude.


Author(s):  
Erick Omuya ◽  
George Okeyo ◽  
Michael Kimwele

Social media has been embraced by different people as a convenient and official medium of communication. People write messages and attach images and videos on Twitter, Facebook and other social media which they share. Social media therefore generates a lot of data that is rich in sentiments from these updates. Sentiment analysis has been used to determine opinions of clients, for instance, relating to a particular product or company. Knowledge based approach and Machine learning approach are among the strategies that have been used to analyze these sentiments. The performance of sentiment analysis is however distorted by noise, the curse of dimensionality, the data domains and size of data used for training and testing. This research aims at developing a model for sentiment analysis in which dimensionality reduction and the use of different parts of speech improves sentiment analysis performance. It uses natural language processing for filtering, storing and performing sentiment analysis on the data from social media. The model is tested using Naïve Bayes, Support Vector Machines and K-Nearest neighbor machine learning algorithms and its performance compared with that of two other Sentiment Analysis models. Experimental results show that the model improves sentiment analysis performance using machine learning techniques.


2021 ◽  
Vol 10 (6) ◽  
pp. 3178-3190
Author(s):  
Ahmad Yahya Dawod ◽  
Mohammed Ali Sharafuddin

Mangrove is one of the most productive global forest ecosystems and unique in linking terrestrial and marine environment. This study aims to clarify and understand artificial intelligence (AI) adoption in remote sensing mangrove forests. The performance of machine learning algorithms such as random forest (RF), support vector machine (SVM), decision tree (DT), and object-based nearest neighbors (NN) algorithms were used in this study to automatically classify mangrove forests using orthophotography and applying an object-based approach to examine three features (tree cover loss, above-ground carbon dioxide (CO2) emissions, and above-ground biomass loss). SVM with a radial basis function was used to classify the remainder of the images, resulting in an overall accuracy of 96.83%. Precision and recall reached 93.33 and 96%, respectively. RF performed better than other algorithms where there is no orthophotography. 


2021 ◽  
Vol 13 (2) ◽  
pp. 1199-1208
Author(s):  
N. Ajaypradeep ◽  
Dr.R. Sasikala

Autism is a developmental disorder which affects cognition, social and behavioural functionalities of a person. When a person is affected by autism spectrum disorder, he/she will exhibit peculiar behaviours and those symptoms initiate from that patient’s childhood. Early diagnosis of autism is an important and challenging task. Behavioural analysis a well known therapeutic practice can be adopted for earlier diagnosis of autism. Machine learning is a computational methodology, which can be applied to a wide range of applications in-order to obtain efficient outputs. At present machine learning is especially applied in medical applications such as disease prediction. In our study we evaluated various machine learning algorithms [(Naive bayes (NB), Support Vector Machines (SVM) and k-Nearest Neighbours (KNN)] with “k-fold” based cross validation for 3 datasets retrieved from the UCI repository. Additionally we validated the effective accuracy of the estimated results using a clustered cross validation strategy. The process of employing the clustered cross validation scrutinises the parameters which contributes more importance in the dataset. The strategy induces hyper parameter tuning which yields trusted results as it involves double validation. On application of the clustered cross validation for a SVM based model, we obtained an accuracy of 99.6% accuracy for autism child dataset.


2021 ◽  
Vol 309 ◽  
pp. 01042
Author(s):  
L. Chandrika ◽  
K. Madhavi ◽  
B. Sindhuja ◽  
M. Arshi

Prediction of a cardiovascular diseases has always a tedious challenge for doctors and medical practitioners. Most of the practitioners and hospital staff offers expensive medication, care and surgeries to treat the cardiovascular patients. At early-stage of prediction of heart-oriented problems will be giving a chance of survival by taking necessary precautions. Over the years there are different types of methodologies were proposed to predict the cardiovascular diseases one of the best methodologies is a Machine learning approach. These years many scientific advancements take place in the Artificial Intelligence, Machine learning, and Deep learning which gives an extra push up to help and implement the path in the field of medical image processing and medical data analysis. By using the enormous dataset from various medical experts used to help the researchers to predict the coronary problems prior to happening. Many researchers have tried and implemented different machine learning algorithms to automate the prediction analysis using the enormous number of datasets. There are numerous algorithms and procedures to predict the cardiovascular diseases and accessible to be specific Classification methods including Artificial Neural Networks (AI), Decision tree (DT), Support vector machine (SVM), Genetic algorithm (GA), Neural network (NN), Naive Bayes (NB) and Clustering algorithms like K-NN. A few examinations have been done for creating expectation models utilizing singular procedures and additionally concatenating at least two strategies. This paper gives a speedy and simple survey and knowledge of approachable prediction models using different researchers work from 2004 to 2019. The examination indicates the precision of individual experiments done by various researchers.


Sign in / Sign up

Export Citation Format

Share Document