A cluster-genetic programming approach for detecting pulmonary tuberculosis

Tuberculosis (TB) remains a global health concern. It commonly spreads through the air and attacks low immune bodies. TB is the most common and known health problem in low and middle-income countries. Genetic programming (GP) is a machine learning model for discovering useful relationships among the variables in complex clinical data. It is more appropriate in a circumstance when the form of the solution model is unknown a priori. The main objective of this study is to develop a model that can detect positive cases of TB suspected patients using genetic programming approach. In this paper, Genetic Programming (GP) is exploited to identify the presence of positive cases of tuberculosis from the real data set of TB suspects and hospitalized patients. First, the dataset is pre-processed, and target variables are identified using cluster analysis. This data-driven cluster analysis identifies two distinct clusters of patients, representing TB positive and TB negative. Then, GP is trained using the training datasets to construct a prediction model and tested with a separate new dataset. With the 30 runs, the median performance of GP on test data was good (sensitivity=0.78, specificity=0.95, accuracy=0.89, AUC=0.91). We find that GP shows better performance in predicting TB compared to other machine learning models. The study demonstrates that the GP model might be used to support clinicians to screen TB patients.

Download Full-text

An Experimental Study of Spammer Detection on Chinese Microblogs

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s021819402040029x ◽

2020 ◽

Vol 30 (11n12) ◽

pp. 1759-1777

Author(s):

Jialing Liang ◽

Peiquan Jin ◽

Lin Mu ◽

Jie Zhao

Keyword(s):

Machine Learning ◽

Social Media ◽

User Behavior ◽

Real Data ◽

User Profile ◽

Data Set ◽

Sina Weibo ◽

Factors Affecting ◽

The Government ◽

Hot Event

With the development of Web 2.0, social media such as Twitter and Sina Weibo have become an essential platform for disseminating hot events. Simultaneously, due to the free policy of microblogging services, users can post user-generated content freely on microblogging platforms. Accordingly, more and more hot events on microblogging platforms have been labeled as spammers. Spammers will not only hurt the healthy development of social media but also introduce many economic and social problems. Therefore, the government and enterprises must distinguish whether a hot event on microblogging platforms is a spammer or is a naturally-developing event. In this paper, we focus on the hot event list on Sina Weibo and collect the relevant microblogs of each hot event to study the detecting methods of spammers. Notably, we develop an integral feature set consisting of user profile, user behavior, and user relationships to reflect various factors affecting the detection of spammers. Then, we employ typical machine learning methods to conduct extensive experiments on detecting spammers. We use a real data set crawled from the most prominent Chinese microblogging platform, Sina Weibo, and evaluate the performance of 10 machine learning models with five sampling methods. The results in terms of various metrics show that the Random Forest model and the over-sampling method achieve the best accuracy in detecting spammers and non-spammers.

Download Full-text

IMAGE BASED RECOGNITION OF DYNAMIC TRAFFIC SITUATIONS BY EVALUATING THE EXTERIOR SURROUNDING AND INTERIOR SPACE OF VEHICLES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprsarchives-xl-3-w3-161-2015 ◽

2015 ◽

Vol XL-3/W3 ◽

pp. 161-168

Author(s):

A. Hanel ◽

H. Klöden ◽

L. Hoegner ◽

U. Stilla

Keyword(s):

Machine Learning ◽

Real Data ◽

Traffic Situation ◽

Dynamic Traffic ◽

Interior Space ◽

Data Set ◽

Road Users ◽

Vehicle Fleet ◽

New Strategies

Today, cameras mounted in vehicles are used to observe the driver as well as the objects around a vehicle. In this article, an outline of a concept for image based recognition of dynamic traffic situations is shown. A dynamic traffic situation will be described by road users and their intentions. Images will be taken by a vehicle fleet and aggregated on a server. On these images, new strategies for machine learning will be applied iteratively when new data has arrived on the server. The results of the learning process will be models describing the traffic situation and will be transmitted back to the recording vehicles. The recognition will be performed as a standalone function in the vehicles and will use the received models. It can be expected, that this method can make the detection and classification of objects around the vehicles more reliable. In addition, the prediction of their actions for the next seconds should be possible. As one example how this concept is used, a method to recognize the illumination situation of a traffic scene is described. This allows to handle different appearances of objects depending on the illumination of the scene. Different illumination classes will be defined to distinguish different illumination situations. Intensity based features are extracted from the images and used by a classifier to assign an image to an illumination class. This method is being tested for a real data set of daytime and nighttime images. It can be shown, that the illumination class can be classified correctly for more than 80% of the images.

Download Full-text

Graph-Based Semi-Supervised Learning With Big Data

Cognitive Analytics ◽

10.4018/978-1-7998-2460-2.ch012 ◽

2020 ◽

pp. 214-244

Author(s):

Prithish Banerjee ◽

Mark Vere Culp ◽

Kenneth Jospeh Ryan ◽

George Michailidis

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Learning ◽

Prior Knowledge ◽

Linear Algebra ◽

Real Data ◽

Data Set ◽

Regression Problems ◽

Classification And Regression ◽

Empirical Demonstration

This chapter presents some popular graph-based semi-supervised approaches. These techniques apply to classification and regression problems and can be extended to big data problems using recently developed anchor graph enhancements. The background necessary for understanding this Chapter includes linear algebra and optimization. No prior knowledge in methods of machine learning is necessary. An empirical demonstration of the techniques for these methods is also provided on real data set benchmarks.

Download Full-text

Ensemble modeling approach for rainfall/groundwater balancing

Journal of Hydroinformatics ◽

10.2166/hydro.2007.102 ◽

2007 ◽

Vol 9 (2) ◽

pp. 95-106 ◽

Cited By ~ 6

Author(s):

D. Laucelli ◽

O. Giustolisi ◽

V. Babovic ◽

M. Keijzer

Keyword(s):

Machine Learning ◽

Genetic Programming ◽

Empirical Evidence ◽

Averaging Method ◽

Total Error ◽

Ensemble Methods ◽

Real Data ◽

Symbolic Regression ◽

Ensemble Modeling ◽

Physical Phenomena

This paper introduces an application of machine learning, on real data. It deals with Ensemble Modeling, a simple averaging method for obtaining more reliable approximations using symbolic regression. Considerations on the contribution of bias and variance to the total error, and ensemble methods to reduce errors due to variance, have been tackled together with a specific application of ensemble modeling to hydrological forecasts. This work provides empirical evidence that genetic programming can greatly benefit from this approach in forecasting and simulating physical phenomena. Further considerations have been taken into account, such as the influence of Genetic Programming parameter settings on the model's performance.

Download Full-text

Migration moveout analysis and depth focusing

Geophysics ◽

10.1190/1.1443354 ◽

1993 ◽

Vol 58 (1) ◽

pp. 91-100 ◽

Cited By ~ 52

Author(s):

Claude F. Lafond ◽

Alan R. Levander

Keyword(s):

Heterogeneous Media ◽

A Priori ◽

Complex Structure ◽

Tomographic Reconstruction ◽

Synthetic Data ◽

Real Data ◽

Velocity Model ◽

Velocity Analysis ◽

Data Set ◽

Velocity Models

Prestack depth migration still suffers from the problems associated with building appropriate velocity models. The two main after‐migration, before‐stack velocity analysis techniques currently used, depth focusing and residual moveout correction, have found good use in many applications but have also shown their limitations in the case of very complex structures. To address this issue, we have extended the residual moveout analysis technique to the general case of heterogeneous velocity fields and steep dips, while keeping the algorithm robust enough to be of practical use on real data. Our method is not based on analytic expressions for the moveouts and requires no a priori knowledge of the model, but instead uses geometrical ray tracing in heterogeneous media, layer‐stripping migration, and local wavefront analysis to compute residual velocity corrections. These corrections are back projected into the velocity model along raypaths in a way that is similar to tomographic reconstruction. While this approach is more general than existing migration velocity analysis implementations, it is also much more computer intensive and is best used locally around a particularly complex structure. We demonstrate the technique using synthetic data from a model with strong velocity gradients and then apply it to a marine data set to improve the positioning of a major fault.

Download Full-text

MOOC Video Personalized Classification Based on Cluster Analysis and Process Mining

Sustainability ◽

10.3390/su12073066 ◽

2020 ◽

Vol 12 (7) ◽

pp. 3066

Author(s):

Feng Zhang ◽

Di Liu ◽

Cong Liu

Keyword(s):

Cluster Analysis ◽

Flipped Classroom ◽

Process Model ◽

Question Answering ◽

Process Mining ◽

Subjective Evaluation ◽

Real Data ◽

Data Set ◽

Teacher Needs ◽

Knowledge Levels

In the teaching based on MOOC (Massive Open Online Courses) and flipped classroom, a teacher needs to understand the difficulty and importance of MOOC videos in real time for students at different knowledge levels. In this way, a teacher can be more focused on the different difficulties and key points contained in the videos for students in a flipped classroom. Thus, the personalized teaching can be implemented. We propose an approach of MOOC video personalized classification based on cluster analysis and process mining to help a teacher understand the difficulty and importance of MOOC videos for students at different knowledge levels. Specifically, students are first clustered based on their knowledge levels through question answering data. Then, we propose the process model of a group of students which reflects the overall video watching behavior of these students. Next, we propose to use the process mining technique to mine the process model of each student cluster by the video watching data of the involved students. Finally, we propose an approach to measure the difficulty and importance of a video based on a process model. With this approach, MOOC videos can be classified for students at different knowledge levels according to difficulty and importance. Therefore, a teacher can carry out a flipped classroom more efficiently. Experiments on a real data set show that the difficulty and importance of videos obtained by the proposed approach can reflect students’ subjective evaluation of the videos.

Download Full-text

GPdotNET Open Source Software for Running Genetic Programming

Optimized Genetic Programming Applications - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-5225-6005-0.ch005 ◽

2018 ◽

pp. 183-242

Keyword(s):

Machine Learning ◽

Genetic Programming ◽

Search Algorithm ◽

Supervised Machine Learning ◽

Learning Problems ◽

Classification Models ◽

R Language ◽

Programming Tool ◽

Wolfram Mathematica ◽

Gp Model

In this chapter, GPdotNET v5 genetic programming tool is presented from the user's perspective. GPdotNET is a computer program for running tree-based genetic programming, and its application is modelling supervised machine-learning-based problems. The chapter contains detailed information on how to use GPdotNET in order to prepare data, setup GP parameters, and to run the GP search algorithm. Since GPdotNET supports all three kinds of supervised machine learning problems, the chapter contains three use cases which demonstrate how to successfully build high quality regression, binary, and classification models. GPdotNET contains export module, where the user is able to export GP model to Excel, R language, and Wolfram Mathematica.

Download Full-text

STACKING OF THE SGTM NEURAL-LIKE STRUCTURE WITH RBF LAYER BASED ON GENERATION OF A RANDOM CURTAIN OF ITS HYPERPARAMETERS FOR PREDICTION TASKS

Ukrainian Journal of Information Technology ◽

10.23939/ujit2021.03.049 ◽

2021 ◽

Vol 3 (1) ◽

pp. 49-55

Author(s):

R. O. Tkachenko ◽

◽

I. V. Izonіn ◽

V. M. Danylyk ◽

V. Yu. Mykhalevych ◽

...

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Experimental Studies ◽

Real Data ◽

Optimal Number ◽

Individual Member ◽

Optimal Parameters ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Improving prediction accuracy by artificial intelligence tools is an important task in various industries, economics, medicine. Ensemble learning is one of the possible options to solve this task. In particular, the construction of stacking models based on different machine learning methods, or using different parts of the existing data set demonstrates high prediction accuracy of the. However, the need for proper selection of ensemble members, their optimal parameters, etc., necessitates large time costs for the construction of such models. This paper proposes a slightly different approach to building a simple but effective ensemble method. The authors developed a new model of stacking of nonlinear SGTM neural-like structures, which is based on the use of only one type of ANN as an element base of the ensemble and the use of the same training sample for all members of the ensemble. This approach provides a number of advantages over the procedures for building ensembles based on different machine learning methods, at least in the direction of selecting the optimal parameters for each of them. In our case, a tuple of random hyperparameters for each individual member of the ensemble was used as the basis of ensemble. That is, the training of each combined SGTM neural-like structure with an additional RBF layer, as a separate member of the ensemble occurs using different, randomly selected values of RBF centers and centersfof mass. This provides the necessary variety of ensemble elements. Experimental studies on the effectiveness of the developed ensemble were conducted using a real data set. The task is to predict the amount of health insurance costs based on a number of independent attributes. The optimal number of ensemble members is determined experimentally, which provides the highest prediction accuracy. The results of the work of the developed ensemble are compared with the existing methods of this class. The highest prediction accuracy of the developed ensemble at satisfactory duration of procedure of its training is established.

Download Full-text

Selecting the Best Forecasting-Implied Volatility Model Using Genetic Programming

Journal of Applied Mathematics and Decision Sciences ◽

10.1155/2009/179230 ◽

2009 ◽

Vol 2009 ◽

pp. 1-19 ◽

Cited By ~ 5

Author(s):

Wafa Abdelmalek ◽

Sana Ben Hamida ◽

Fathi Abid

Keyword(s):

Time Series ◽

Genetic Programming ◽

Implied Volatility ◽

Real Data ◽

Programming Approach ◽

Sample Mean ◽

Mean Squared Errors ◽

Hedging Strategies ◽

Volatility Model ◽

Out Of Sample

The volatility is a crucial variable in option pricing and hedging strategies. The aim of this paper is to provide some initial evidence of the empirical relevance of genetic programming to volatility's forecasting. By using real data from S&P500 index options, the genetic programming's ability to forecast Black and Scholes-implied volatility is compared between time series samples and moneyness-time to maturity classes. Total and out-of-sample mean squared errors are used as forecasting's performance measures. Comparisons reveal that the time series model seems to be more accurate in forecasting-implied volatility than moneyness time to maturity models. Overall, results are strongly encouraging and suggest that the genetic programming approach works well in solving financial problems.

Download Full-text

A credibility method for profitable cross-selling of insurance products

Annals of Actuarial Science ◽

10.1017/s1748499511000327 ◽

2011 ◽

Vol 6 (1) ◽

pp. 65-75 ◽

Cited By ~ 8

Author(s):

Fredrik Thuring

Keyword(s):

A Priori ◽

Real Data ◽

Risk Profile ◽

Insurance Company ◽

Data Set ◽

Credibility Estimator ◽

Insurance Product ◽

Additional Product

AbstractA method is presented for identifying an expected profitable set of customers, to offer them an additional insurance product, by estimating a customer specific latent risk profile, for the additional product, by using the customer specific available data for an existing insurance product of the specific customer. For the purpose, a multivariate credibility estimator is considered and we investigate the effect of assuming that one (of two) insurance products is inactive (without available claims information) when estimating the latent risk profile. Instead, available customer specific claims information from the active existing insurance product is used to estimate the risk profile and thereafter assess whether or not to include a specific customer in an expected profitable set of customers. The method is tested using a large real data set from a Danish insurance company and it is shown that sets of customers, with up to 36% less claims than a priori expected, are produced as a result of the method. It is therefore argued that the proposed method could be considered, by an insurance company, when cross-selling insurance products to existing customers.

Download Full-text