Are missing values important for earnings forecasts? A machine learning perspective

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.

Download Full-text

Missing the missing values: The ugly duckling of fairness in machine learning

International Journal of Intelligent Systems ◽

10.1002/int.22415 ◽

2021 ◽

Author(s):

Martínez‐Plumed Fernando ◽

Ferri Cèsar ◽

Nieves David ◽

Hernández‐Orallo José

Keyword(s):

Machine Learning ◽

Missing Values ◽

Ugly Duckling

Download Full-text

Impact of strong El Niño events on river discharge in South America

10.5194/egusphere-egu21-10383 ◽

2021 ◽

Author(s):

Markus Deppner ◽

Bedartha Goswami

Keyword(s):

Machine Learning ◽

Missing Data ◽

South America ◽

River Discharge ◽

Amazon Basin ◽

Missing Values ◽

Southern Oscillation ◽

Enso Events ◽

Streamflow Data ◽

The Impact

<p>The impact of the El Ni&#241;o Southern Oscillation (ENSO) on rivers are well known, but most existing studies involving streamflow data are severely limited by data coverage. Time series of gauging stations fade in and out over time, which makes hydrological large scale and long time analysis or studies of rarely occurring extreme events challenging. Here, we use a machine learning approach to infer missing streamflow data based on temporal correlations of stations with missing values to others with data. By using 346 stations, from the &#8220;Global Streamflow Indices and Metadata archive&#8221; (GSIM), that initially cover the 40 year timespan in conjunction with Gaussian processes we were able to extend our data by estimating missing data for an additional 646 stations, allowing us to include a total of 992 stations. We then investigate the impact of the 6 strongest El Ni&#241;o (EN) events on rivers in South America between 1960 and 2000. Our analysis shows a strong correlation between ENSO events and extreme river dynamics in the southeast of Brazil, Carribean South America and parts of the Amazon basin. Furthermore we see a peak in the number of stations showing maximum river discharge all over Brazil during the EN of 1982/83 which has been linked to severe floods in the east of Brazil, parts of Uruguay and Paraguay. However EN events in other years with similar intensity did not evoke floods with such magnitude and therefore the additional drivers of the 1982/83&#160; floods need further investigation. By using machine learning methods to infer data for gauging stations with missing data we were able to extend our data by almost three-fold, revealing a possible heavier and spatially larger impact of the 1982/83 EN on South America's hydrology than indicated in literature.</p>

Download Full-text

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Download Full-text

A Contemporary Machine Learning Method for Accurate Prediction of Cervical Cancer

SHS Web of Conferences ◽

10.1051/shsconf/202110204004 ◽

2021 ◽

Vol 102 ◽

pp. 04004

Author(s):

Jesse Jeremiah Tanimu ◽

Mohamed Hamada ◽

Mohammed Hassan ◽

Saratu Yusuf Ilu

Keyword(s):

Machine Learning ◽

Cervical Cancer ◽

Feature Selection ◽

Decision Tree ◽

Sensitivity And Specificity ◽

Missing Values ◽

New Technologies ◽

Machine Learning Techniques ◽

Screening Tests ◽

Tree Classifier

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.

Download Full-text

Evolutionary Machine Learning for Classification with Incomplete Data

10.26686/wgtn.17072123 ◽

2021 ◽

Author(s):

◽

Cao Truong Tran

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Genetic Programming ◽

Incomplete Data ◽

Missing Values ◽

Machine Learning Techniques ◽

Feature Construction ◽

Classification Algorithms ◽

Learning Techniques ◽

Effectiveness And Efficiency

<p>Classification is a major task in machine learning and data mining. Many real-world datasets suffer from the unavoidable issue of missing values. Classification with incomplete data has to be carefully handled because inadequate treatment of missing values will cause large classification errors. Existing most researchers working on classification with incomplete data focused on improving the effectiveness, but did not adequately address the issue of the efficiency of applying the classifiers to classify unseen instances, which is much more important than the act of creating classifiers. A common approach to classification with incomplete data is to use imputation methods to replace missing values with plausible values before building classifiers and classifying unseen instances. This approach provides complete data which can be then used by any classification algorithm, but sophisticated imputation methods are usually computationally intensive, especially for the application process of classification. Another approach to classification with incomplete data is to build a classifier that can directly work with missing values. This approach does not require time for estimating missing values, but it often generates inaccurate and complex classifiers when faced with numerous missing values. A recent approach to classification with incomplete data which also avoids estimating missing values is to build a set of classifiers which then is used to select applicable classifiers for classifying unseen instances. However, this approach is also often inaccurate and takes a long time to find applicable classifiers when faced with numerous missing values. The overall goal of the thesis is to simultaneously improve the effectiveness and efficiency of classification with incomplete data by using evolutionary machine learning techniques for feature selection, clustering, ensemble learning, feature construction and constructing classifiers. The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classification with incomplete data. The thesis develops wrapper-based feature selection methods to improve input space for classification algorithms that are able to work directly with incomplete data. The methods not only improve the classification accuracy, but also reduce the complexity of classifiers able to work directly with incomplete data. The thesis develops a feature construction method to improve input space for classification algorithms with incomplete data by proposing interval genetic programming-genetic programming with a set of interval functions. The method improves the classification accuracy and reduces the complexity of classifiers. The thesis develops an ensemble approach to classification with incomplete data by integrating imputation, feature selection, and ensemble learning. The results show that the approach is more accurate, and faster than previous common methods for classification with incomplete data. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to work directly with incomplete data. In summary, the thesis develops a range of approaches for simultaneously improving the effectiveness and efficiency of classification with incomplete data by using a range of evolutionary machine learning techniques.</p>

Download Full-text

Interaction between Model Based Signal and Image Processing, Machine Learning and Artificial Intelligence

Proceedings ◽

10.3390/proceedings2019033016 ◽

2019 ◽

Vol 33 (1) ◽

pp. 16

Author(s):

Ali Mohammad-Djafari

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Image Processing ◽

Bayesian Inference ◽

Missing Values ◽

General Purpose ◽

Parametric Models ◽

Great Success ◽

Iterative Regularization ◽

Model Based

Signale and image processing has always been the main tools in many area and in particular in Medical and Biomedical applications. Nowadays, there are great number of toolboxes, general purpose and very specialized, in which classical techniques are implemented and can be used: all the transformation based methods (Fourier, Wavelets, ...) as well as model based and iterative regularization methods. Statistical methods have also shown their success in some area when parametric models are available. Bayesian inference based methods had great success, in particular, when the data are noisy, uncertain, incomplete (missing values) or with outliers and where there is a need to quantify uncertainties. In some applications, nowadays, we have more and more data. To use these “Big Data” to extract more knowledge, the Machine Learning and Artificial Intelligence tools have shown success and became mandatory. However, even if in many domains of Machine Learning such as classification and clustering these methods have shown success, their use in real scientific problems are limited. The main reasons are twofold: First, the users of these tools cannot explain the reasons when the are successful and when they are not. The second is that, in general, these tools can not quantify the remaining uncertainties. Model based and Bayesian inference approach have been very successful in linear inverse problems. However, adjusting the hyper parameters is complex and the cost of the computation is high. The Convolutional Neural Networks (CNN) and Deep Learning (DL) tools can be useful for pushing farther these limits. At the other side, the Model based methods can be helpful for the selection of the structure of CNN and DL which are crucial in ML success. In this work, I first provide an overview and then a survey of the aforementioned methods and explore the possible interactions between them.

Download Full-text

Assessing Machine Learning Models for Gap Filling Daily Rainfall Series in a Semiarid Region of Spain

Atmosphere ◽

10.3390/atmos12091158 ◽

2021 ◽

Vol 12 (9) ◽

pp. 1158

Author(s):

Juan Antonio Bellido-Jiménez ◽

Javier Estévez Gualda ◽

Amanda Penélope García-Marín

Keyword(s):

Machine Learning ◽

Missing Values ◽

Mean Bias Error ◽

Semiarid Region ◽

Bias Error ◽

Spatial And Temporal Variability ◽

Rainfall Time Series ◽

Learning Models ◽

Gap Filling ◽

Machine Learning Models

The presence of missing data in hydrometeorological datasets is a common problem, usually due to sensor malfunction, deficiencies in records storage and transmission, or other recovery procedures issues. These missing values are the primary source of problems when analyzing and modeling their spatial and temporal variability. Thus, accurate gap-filling techniques for rainfall time series are necessary to have complete datasets, which is crucial in studying climate change evolution. In this work, several machine learning models have been assessed to gap-fill rainfall data, using different approaches and locations in the semiarid region of Andalusia (Southern Spain). Based on the obtained results, the use of neighbor data, located within a 50 km radius, highly outperformed the rest of the assessed approaches, with RMSE (root mean squared error) values up to 1.246 mm/day, MBE (mean bias error) values up to −0.001 mm/day, and R2 values up to 0.898. Besides, inland area results outperformed coastal area in most locations, arising the efficiency effects based on the distance to the sea (up to an improvement of 63.89% in terms of RMSE). Finally, machine learning (ML) models (especially MLP (multilayer perceptron)) notably outperformed simple linear regression estimations in the coastal sites, whereas in inland locations, the improvements were not such significant.

Download Full-text

Machine Learning for the Evolutionary Analysis of Breast Cancer

Journal of Science and Research Revista Ciencia e Investigación ◽

10.26910/issn.2528-8083vol3isscitt2017.2018pp44-49 ◽

2018 ◽

Vol 3 (CITT2017) ◽

pp. 44-49

Author(s):

Alexander Mackenzie Rivero ◽

Alberto Rodríguez Rodríguez ◽

Edwin Joao Merchán Carreño ◽

Rodrigo Martínez Béjar

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Data Model ◽

Missing Values ◽

Medical Center ◽

Classification Algorithms ◽

Evolutionary Analysis ◽

Data Set ◽

Analysis Techniques ◽

The University

The use of machine learning allows the creation of a predictive data model, as a result of the analysis in a data set with 286 instances and nine attributes belonging to the Institute of Oncology of the University Medical Center. Ljubljana. Based on this situation, the data are preprocessed by applying intelligent data analysis techniques to eliminate missing values as well as the evaluation of each attribute that allows the optimization of results. We used several classification algorithms including J48 trees, random forest, bayes net, naive bayes, decision table, in order to obtain one that given the characteristics of the data, would allow the best classification percentage and therefore a better matrix of confusion, Using 66 % of the data for learning and 33 % for validating the model. Using this model, a predictor with a 71,134 % e effectiveness is obtained to estimate or not the recurrence of breast cancer.

Download Full-text

Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices

10.1101/744789 ◽

2019 ◽

Author(s):

Ananya Bhattacharjee ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Missing Data ◽

Phylogenetic Trees ◽

Large Scale ◽

Missing Values ◽

Gene Tree ◽

Estimation Methods ◽

Learning Technique ◽

Distance Matrices

AbstractBackgroundDue to the recent advances in sequencing technologies and species tree estimation methods capable of taking gene tree discordance into account, notable progress has been achieved in constructing large scale phylogenetic trees from genome wide data. However, substantial challenges remain in leveraging this huge amount of molecular data. One of the foremost among these challenges is the need for efficient tools that can handle missing data. Popular distance-based methods such as neighbor joining and UPGMA require that the input distance matrix does not contain any missing values.ResultsWe introduce two highly accurate machine learning based distance imputation techniques. One of our approaches is based on matrix factorization, and the other one is an autoencoder based deep learning technique. We evaluate these two techniques on a collection of simulated and biological datasets, and show that our techniques match or improve upon the best alternate techniques for distance imputation. Moreover, our proposed techniques can handle substantial amount of missing data, to the extent where the best alternate methods fail.ConclusionsThis study shows for the first time the power and feasibility of applying deep learning techniques for imputing distance matrices. The autoencoder based deep learning technique is highly accurate and scalable to large dataset. We have made these techniques freely available as a cross-platform software (available at https://github.com/Ananya-Bhattacharjee/ImputeDistances).

Download Full-text