Machine Learning for Identifying Relevance to Biosurveillance in Multilingual Text

Objective: The objective is to develop an ensemble of machine learning algorithms to identify multilingual, online articles that are relevant to biosurveillance. Language morphology varies widely across languages and must be accounted for when designing algorithms. Here, we compare the performance of a word embedding-based approach and a topic modeling approach with machine learning algorithms to determine the best method for Chinese, Arabic, and French languages.Introduction: Global biosurveillance is an extremely important, yet challenging task. One form of global biosurveillance comes from harvesting open source online data (e.g. news, blogs, reports, RSS feeds). The information derived from this data can be used for timely detection and identification of biological threats all over the world. However, the more inclusive the data harvesting procedure is to ensure that all potentially relevant articles are collected, the more data that is irrelevant also gets harvested. This issue can become even more complex when the online data is in a non-native language. Foreign language articles not only create language-specific issues for Natural Language Processing (NLP), but also add a significant amount of translation costs. Previous work shows success in the use of combinatory monolingual classifiers in specific applications, e.g., legal domain [1]. A critical component for a comprehensive, online harvesting biosurveillance system is the capability to identify relevant foreign language articles from irrelevant ones based on the initial article information collected, without the additional cost of full text retrieval and translation.Methods: The analysis text dataset contains the title and brief description of 3506 online articles in Chinese, Arabic, and French languages from the date range of August, 17, 2016 to July 5, 2017. The NLP article pre-processing steps are language-specific tokenization and stop words removal. We compare two different approaches: word embeddings and topic modeling (Fig. 1). For word embeddings, we first generate word vectors for the data using a pretrained Word2Vec (W2V) model [2]. Subsequently, the word vectors within a document are averaged to produce a single feature vector for the document. Then, we fit a machine learning algorithm (random forest classifier or Support Vector Machine (SVM)) to the training vectors and get predictions for the test documents. For topic modelling, we used a Latent Dirichlet Allocation (LDA) model to generate five topics for all relevant documents [3]. For each new document, the output is the probability measure for the document belonging to these five topics. Here, we classify the new document by comparing the probability measure with a relevancy threshold.Results: The Word2Vec model combined with a random forest classifier outperformed the other approaches across the three languages (Fig. 2); the Chinese model has an 89% F1-score, the Arabic model has 86%, and the French model has 94%. To decrease the chance of calling a potentially relevant article irrelevant, high recall was more important than high precision. In the Chinese model, the Word2Vec with a random forest approach had the highest recall at 98% (Table 1).Conclusions: We present research findings on different approaches of relevance to biosurveillance identification on non-English texts and identify the best performing methods for implementation into a biosurveillance online article harvesting system. Our initial results suggest that the word embeddings model has an advantage over topic modeling, and the random forest classifier outperforms the SVM. Directions for future work will aim to further expand the list of languages and methods to be compared, e.g., n-grams and non-negative matrix factorization. In addition, we will fine-tune the Arabic and French model for better accuracy results.

Download Full-text

A Comparative Study using Feature Selection to Predict the Behaviour of Bank Customers

E3S Web of Conferences ◽

10.1051/e3sconf/202018401011 ◽

2020 ◽

Vol 184 ◽

pp. 01011

Author(s):

Sreethi Musunuru ◽

Mahaalakshmi Mukkamala ◽

Latha Kunaparaju ◽

N V Ganapathi Raju

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Random Forest Classifier ◽

Customer Behavior ◽

Machine Learning Algorithms ◽

The Status ◽

Personal Level ◽

Near Future ◽

Structure Communication

Though banks hold an abundance of data on their customers in general, it is not unusual for them to track the actions of the creditors regularly to improve the services they offer to them and understand why a lot of them choose to exit and shift to other banks. Analyzing customer behavior can be highly beneficial to the banks as they can reach out to their customers on a personal level and develop a business model that will improve the pricing structure, communication, advertising, and benefits for their customers and themselves. Features like the amount a customer credits every month, his salary per annum, the gender of the customer, etc. are used to classify them using machine learning algorithms like K Neighbors Classifier and Random Forest Classifier. On classifying the customers, banks can get an idea of who will be continuing with them and who will be leaving them in the near future. Our study determines to remove the features that are independent but are not influential to determine the status of the customers in the future without the loss of accuracy and to improve the model to see if this will also increase the accuracy of the results.

Download Full-text

Predicting Fitness Centre Dropout

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph181910465 ◽

2021 ◽

Vol 18 (19) ◽

pp. 10465

Author(s):

Pedro Sobreiro ◽

Pedro Guedes-Carvalho ◽

Abel Santos ◽

Paulo Pinheiro ◽

Celina Gonçalves

Keyword(s):

Machine Learning ◽

Random Forest ◽

Length Of Stay ◽

Random Forest Classifier ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Registration Data ◽

Relevant Variables ◽

Overall Performance ◽

Number Of Visits

The phenomenon of dropout is often found among customers of sports services. In this study we intend to evaluate the performance of machine learning algorithms in predicting dropout using available data about their historic use of facilities. The data relating to a sample of 5209 members was taken from a Portuguese fitness centre and included the variables registration data, payments and frequency, age, sex, non-attendance days, amount billed, average weekly visits, total number of visits, visits hired per week, number of registration renewals, number of members referrals, total monthly registrations, and total member enrolment time, which may be indicative of members’ commitment. Whilst the Gradient Boosting Classifier had the best performance in predicting dropout (sensitivity = 0.986), the Random Forest Classifier was the best at predicting non-dropout (specificity = 0.790); the overall performance of the Gradient Boosting Classifier was superior to the Random Forest Classifier (accuracy 0.955 against 0.920). The most relevant variables predicting dropout were “non-attendance days”, “total length of stay”, and “total amount billed”. The use of decision trees provides information that can be readily acted upon to identify member profiles of those at risk of dropout, giving also guidelines for measures and policies to reduce it.

Download Full-text

Heart Failure Detection Using Quantum-Enhanced Machine Learning and Traditional Machine Learning Techniques for Internet of Artificially Intelligent Medical Things

Wireless Communications and Mobile Computing ◽

10.1155/2021/1616725 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Yogesh Kumar ◽

Apeksha Koul ◽

Pushpendra Singh Sisodia ◽

Jana Shafi ◽

Verma Kavita ◽

...

Keyword(s):

Machine Learning ◽

Heart Failure ◽

Random Forest ◽

Learning Algorithms ◽

Failure Detection ◽

Random Forest Classifier ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Research Progress ◽

Record Management

Quantum-enhanced machine learning plays a vital role in healthcare because of its robust application concerning current research scenarios, the growth of novel medical trials, patient information and record management, procurement of chronic disease detection, and many more. Due to this reason, the healthcare industry is applying quantum computing to sustain patient-oriented attention to healthcare patrons. The present work summarized the recent research progress in quantum-enhanced machine learning and its significance in heart failure detection on a dataset of 14 attributes. In this paper, the number of qubits in terms of the features of heart failure data is normalized by using min-max, PCA, and standard scalar, and further, has been optimized using the pipelining technique. The current work verifies that quantum-enhanced machine learning algorithms such as quantum random forest (QRF), quantum K nearest neighbour (QKNN), quantum decision tree (QDT), and quantum Gaussian Naïve Bayes (QGNB) are better than traditional machine learning algorithms in heart failure detection. The best accuracy rate is (0.89), which the quantum random forest classifier attained. In addition to this, the quantum random forest classifier also incurred the best results in F 1 score, recall and, precision by (0.88), (0.93), and (0.89), respectively. The computation time taken by traditional and quantum-enhanced machine learning algorithms has also been compared where the quantum random forest has the least execution time by 150 microseconds. Hence, the work provides a way to quantify the differences between standard and quantum-enhanced machine learning algorithms to select the optimal method for detecting heart failure.

Download Full-text

A Study of Machine Learning Algorithms for DDoS Detection

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.34922 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 174-178

Author(s):

Sheikh Shehzad Ahmed

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Algorithms ◽

Random Forest Classifier ◽

Attack Detection ◽

Machine Learning Algorithms ◽

The Internet ◽

Ddos Attacks ◽

Decision Tree Classifier ◽

Tree Classifier

The Internet is used practically everywhere in today's digital environment. With the increased use of the Internet comes an increase in the number of threats. DDoS attacks are one of the most popular types of cyber-attacks nowadays. With the fast advancement of technology, the harm caused by DDoS attacks has grown increasingly severe. Because DDoS attacks may readily modify the ports/protocols utilized or how they function, the basic features of these attacks must be examined. Machine learning approaches have also been used extensively in intrusion detection research. Still, it is unclear what features are applicable and which approach would be better suited for detection. With this in mind, the research presents a machine learning-based DDoS attack detection approach. To train the attack detection model, we employ four Machine Learning algorithms: Decision Tree classifier (ID3), k-Nearest Neighbors (k-NN), Logistic Regression, and Random Forest classifier. The results of our experiments show that the Random Forest classifier is more accurate in recognizing attacks.

Download Full-text

Churn Prediction and Fraud Detection in Dairy Sector Using Machine Learning

Advances in Library and Information Science - Handbook of Research on Records and Information Management Strategies for Enhanced Knowledge Coordination ◽

10.4018/978-1-7998-6618-3.ch023 ◽

2021 ◽

pp. 391-406

Author(s):

Hitarth Deepak Shah ◽

Chintan M. Bhatt ◽

Shubham Mitul Patel ◽

Jayshil Bhavin Khajanchi ◽

Jaimin Narendrakumar Makwana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Research Work ◽

Dairy Industry ◽

Fraud Detection ◽

Random Forest Classifier ◽

Machine Learning Algorithms ◽

Financial Capital ◽

Churn Prediction ◽

The World

India has globally been the largest milk-producing country in the world for two decades. About 400 million litres of milk is produced every day. It is the responsibility of a dairy sector to look after the farmers by providing them with various services for their livelihood. The growing financial capital of the dairy industry has enticed various fraudulent behaviour. The majority of suspicious activities are seen during the collection at local collection centres, fake farmer entries, tempered quantity and fat entries manually, and adulteration are the profound malpractices exercised by farmers. So, in this research work, the authors present a profound study on the most popular machine learning methods applied to the problems of farmer churn prediction and fraud detection in the dairies. They applied a plethora of machine learning algorithms to get accurate results for churn and fraud detection. XGBoost Classifier was the best for churn prediction with 93% accuracy, while random forest classifier turns out to be effective for fraud detection with 94% accuracy.

Download Full-text

Predicting stroke risk by Migraine using AI

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217656 ◽

2021 ◽

pp. 285-290

Author(s):

Anchal Singh ◽

Dr. Surabhi Thorat

Keyword(s):

Machine Learning ◽

Random Forest ◽

Early Stage ◽

Blood Clot ◽

Random Forest Classifier ◽

Machine Learning Algorithms ◽

Machine Learning Classification ◽

Life Threatening ◽

Confusion Matrices ◽

The Brain

Stroke is a blood clot or bleeds in the brain, which can make permanent damage that has an effect on mobility, cognition, sight or communication. It is the second leading cause of death worldwide and one of the most life- threatening diseases for persons above 65 years. It damages the brain like “heart attack” which damages the heart. Every 4 minutes someone dies of stroke, but up to 80% of stroke can be prevented if we can identify or predict the occurrence of stroke in its early stage. In this paper, I used different types of machine learning algorithms for stroke prediction on the Healthcare Dataset Stroke data. Four types of machine learning classification algorithms were applied; Linear Regression, Confusion matrices, Random Forest Classifier, and Logistic Regression were used to build the stroke prediction model. Support, Precision, Recall, and F1-score were used to calculate performance measures of machine learning models. The results showed that Random Forest Classifier has achieved the best accuracy at 94 % [1].

Download Full-text

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study (Preprint)

10.2196/preprints.11728 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Random Forest ◽

Precocious Puberty ◽

Prediction Models ◽

Central Precocious Puberty ◽

Machine Learning Algorithms ◽

Stimulation Test ◽

Gnrh Analogue ◽

Prediction Probability

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

Download Full-text

Exploratory Analysis of Driving Force of Wildfires in Australia: An Application of Machine Learning within Google Earth Engine

Remote Sensing ◽

10.3390/rs13010010 ◽

2020 ◽

Vol 13 (1) ◽

pp. 10

Author(s):

Andrea Sulova ◽

Jamal Jokar Arsanjani

Keyword(s):

Climate Change ◽

Machine Learning ◽

Random Forest ◽

Google Earth ◽

Summer Season ◽

Driving Factors ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Training Dataset ◽

Google Earth Engine

Recent studies have suggested that due to climate change, the number of wildfires across the globe have been increasing and continue to grow even more. The recent massive wildfires, which hit Australia during the 2019–2020 summer season, raised questions to what extent the risk of wildfires can be linked to various climate, environmental, topographical, and social factors and how to predict fire occurrences to take preventive measures. Hence, the main objective of this study was to develop an automatized and cloud-based workflow for generating a training dataset of fire events at a continental level using freely available remote sensing data with a reasonable computational expense for injecting into machine learning models. As a result, a data-driven model was set up in Google Earth Engine platform, which is publicly accessible and open for further adjustments. The training dataset was applied to different machine learning algorithms, i.e., Random Forest, Naïve Bayes, and Classification and Regression Tree. The findings show that Random Forest outperformed other algorithms and hence it was used further to explore the driving factors using variable importance analysis. The study indicates the probability of fire occurrences across Australia as well as identifies the potential driving factors of Australian wildfires for the 2019–2020 summer season. The methodical approach and achieved results and drawn conclusions can be of great importance to policymakers, environmentalists, and climate change researchers, among others.

Download Full-text

Amide proton transfer weighted (APTw) imaging based radiomics allows for the differentiation of gliomas from metastases

Scientific Reports ◽

10.1038/s41598-021-85168-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Elisabeth Sartoretti ◽

Thomas Sartoretti ◽

Michael Wyss ◽

Carolin Reischauer ◽

Luuk van Smoorenburg ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Brain Tumors ◽

Proton Transfer ◽

Multilayer Perceptron ◽

Random Forest Classifier ◽

Amide Proton ◽

Low Grade ◽

Who Grade ◽

Amide Proton Transfer

AbstractWe sought to evaluate the utility of radiomics for Amide Proton Transfer weighted (APTw) imaging by assessing its value in differentiating brain metastases from high- and low grade glial brain tumors. We retrospectively identified 48 treatment-naïve patients (10 WHO grade 2, 1 WHO grade 3, 10 WHO grade 4 primary glial brain tumors and 27 metastases) with either primary glial brain tumors or metastases who had undergone APTw MR imaging. After image analysis with radiomics feature extraction and post-processing, machine learning algorithms (multilayer perceptron machine learning algorithm; random forest classifier) with stratified tenfold cross validation were trained on features and were used to differentiate the brain neoplasms. The multilayer perceptron achieved an AUC of 0.836 (receiver operating characteristic curve) in differentiating primary glial brain tumors from metastases. The random forest classifier achieved an AUC of 0.868 in differentiating WHO grade 4 from WHO grade 2/3 primary glial brain tumors. For the differentiation of WHO grade 4 tumors from grade 2/3 tumors and metastases an average AUC of 0.797 was achieved. Our results indicate that the use of radiomics for APTw imaging is feasible and the differentiation of primary glial brain tumors from metastases is achievable with a high degree of accuracy.

Download Full-text

A novel framework for designing a multi-DoF prosthetic wrist control using machine learning

Scientific Reports ◽

10.1038/s41598-021-94449-1 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Chinmay P. Swami ◽

Nicholas Lenhard ◽

Jiyeon Kang

Keyword(s):

Machine Learning ◽

Random Forest ◽

Upper Limb ◽

Daily Living ◽

Machine Learning Algorithms ◽

Data Sets ◽

Random Forest Regression ◽

Prosthetic Devices ◽

Upper Limb Function ◽

The Neural Network

AbstractProsthetic arms can significantly increase the upper limb function of individuals with upper limb loss, however despite the development of various multi-DoF prosthetic arms the rate of prosthesis abandonment is still high. One of the major challenges is to design a multi-DoF controller that has high precision, robustness, and intuitiveness for daily use. The present study demonstrates a novel framework for developing a controller leveraging machine learning algorithms and movement synergies to implement natural control of a 2-DoF prosthetic wrist for activities of daily living (ADL). The data was collected during ADL tasks of ten individuals with a wrist brace emulating the absence of wrist function. Using this data, the neural network classifies the movement and then random forest regression computes the desired velocity of the prosthetic wrist. The models were trained/tested with ADLs where their robustness was tested using cross-validation and holdout data sets. The proposed framework demonstrated high accuracy (F-1 score of 99% for the classifier and Pearson’s correlation of 0.98 for the regression). Additionally, the interpretable nature of random forest regression was used to verify the targeted movement synergies. The present work provides a novel and effective framework to develop an intuitive control for multi-DoF prosthetic devices.

Download Full-text