PV Forecasting Using Support Vector Machine Learning in a Big Data Analytics Context

Renewable energy systems (RES) are reliable by nature; the sun and wind are theoretically endless resources. From the beginnings of the power systems, the concern was to know “how much” energy will be generated. Initially, there were voltmeters and power meters; nowadays, there are much more advanced solar controllers, with small displays and built-in modules that handle big data. Usually, large photovoltaic (PV)-battery systems have sophisticated energy management strategies in order to operate unattended. By adding the information collected by sensors managed with powerful technologies such as big data and analytics, the system is able to efficiently react to environmental factors and respond to consumers’ requirements in real time. According to the weather parameters, the output of PV could be symmetric, supplying an asymmetric electricity demand. Thus, a smart adaptive switching module that includes a forecasting component is proposed to improve the symmetry between the PV output and daily load curve. A scaling approach for smaller off-grid systems that provides an accurate forecast of the PV output based on data collected from sensors is developed. The proposed methodology is based on sensor implementation in RES operation and big data technologies are considered for data processing and analytics. In this respect, we analyze data captured from loggers and forecast the PV output with Support Vector Machine (SVM) and linear regression, finding that Root Mean Square Error (RMSE) for prediction is considerably improved when using more parameters in the machine learning process.

Download Full-text

Research on Parallel Support Vector Machine Based on Spark Big Data Platform

Scientific Programming ◽

10.1155/2021/7998417 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Yao Huimin

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Big Data ◽

Support Vector Machines ◽

Cross Validation ◽

Machine Learning Algorithms ◽

Support Vector ◽

Lambda Architecture ◽

Vector Machines ◽

Data Platform

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.

Download Full-text

Crime Data Forecasting Using Machine Learning and Big Data Analytics

Webology ◽

10.14704/web/v18si04/web18284 ◽

2021 ◽

Vol 18 (Special Issue 04) ◽

pp. 591-606

Author(s):

R. Brindha ◽

Dr.M. Thillaikarasi

Keyword(s):

Neural Network ◽

Machine Learning ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Geographical Information ◽

Recursive Feature Elimination ◽

Support Vector ◽

Crime Data

Big data analytics (BDA) is a system based method with an aim to recognize and examine different designs, patterns and trends under the big dataset. In this paper, BDA is used to visualize and trends the prediction where exploratory data analysis examines the crime data. “A successive facts and patterns have been taken in following cities of California, Washington and Florida by using statistical analysis and visualization”. The predictive result gives the performance using Keras Prophet Model, LSTM and neural network models followed by prophet model which are the existing methods used to find the crime data under BDA technique. But the crime actions increases day by day which is greater task for the people to overcome the challenging crime activities. Some ignored the essential rate of influential aspects. To overcome these challenging problems of big data, many studies have been developed with limited one or two features. “This paper introduces a big data introduces to analyze the influential aspects about the crime incidents, and examine it on New York City. The proposed structure relates the dynamic machine learning algorithms and geographical information system (GIS) to consider the contiguous reasons of crime data. Recursive feature elimination (RFE) is used to select the optimum characteristic data. Exploitation of gradient boost decision tree (GBDT), logistic regression (LR), support vector machine (SVM) and artificial neural network (ANN) are related to develop the optimum data model. Significant impact features were then reviewed by applying GBDT and GIS”. The experimental results illustrates that GBDT along with GIS model combination can identify the crime ranking with high performance and accuracy compared to existing method.”

Download Full-text

Big data analytics for financial Market volatility forecast based on support vector machine

International Journal of Information Management ◽

10.1016/j.ijinfomgt.2019.05.027 ◽

2020 ◽

Vol 50 ◽

pp. 452-462 ◽

Cited By ~ 13

Author(s):

Rongjun Yang ◽

Lin Yu ◽

Yuanjun Zhao ◽

Hongxin Yu ◽

Guiping Xu ◽

...

Keyword(s):

Support Vector Machine ◽

Big Data ◽

Financial Market ◽

Data Analytics ◽

Big Data Analytics ◽

Market Volatility ◽

Support Vector ◽

Volatility Forecast

Download Full-text

The Bigger Picture: Combining Econometrics with Analytics Improves Forecasts of Movie Success

Management Science ◽

10.1287/mnsc.2020.3911 ◽

2021 ◽

Author(s):

Steven F. Lehrer ◽

Tian Xie

Keyword(s):

Machine Learning ◽

Social Media ◽

Big Data ◽

Predictive Analytics ◽

Big Data Analytics ◽

Forecast Accuracy ◽

Support Vector ◽

Significant Heterogeneity ◽

Social Media Data ◽

Media Data

There exists significant hype regarding how much machine learning and incorporating social media data can improve forecast accuracy in commercial applications. To assess if the hype is warranted, we use data from the film industry in simulation experiments that contrast econometric approaches with tools from the predictive analytics literature. Further, we propose new strategies that combine elements from each literature in a bid to capture richer patterns of heterogeneity in the underlying relationship governing revenue. Our results demonstrate the importance of social media data and value from hybrid strategies that combine econometrics and machine learning when conducting forecasts with new big data sources. Specifically, although both least squares support vector regression and recursive partitioning strategies greatly outperform dimension reduction strategies and traditional econometrics approaches in forecast accuracy, there are further significant gains from using hybrid approaches. Further, Monte Carlo experiments demonstrate that these benefits arise from the significant heterogeneity in how social media measures and other film characteristics influence box office outcomes. This paper was accepted by J. George Shanthikumar, big data analytics.

Download Full-text

Premio de Investigación SCHOT 2020: desarrollo y validación de un modelo multivariables de predicción de estadía hospitalaria en pacientes mayores de 65 años sometidos artroplastia total de cadera electiva en Chile utilizando aprendizaje de máquinas

Revista Chilena de Ortopedia y Traumatología ◽

10.1055/s-0041-1740232 ◽

2021 ◽

Vol 62 (03) ◽

pp. e180-e192

Author(s):

Claudio Díaz-Ledezma ◽

David Díaz-Solís ◽

Raúl Muñoz-Reyes ◽

Jonathan Torres Castro

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Big Data ◽

Receiver Operating Characteristic Curve ◽

Receiver Operating Characteristic ◽

Operating Characteristic ◽

Characteristic Curve ◽

Support Vector ◽

Operating Characteristic Curve ◽

Receiver Operating

Resumen Introducción La predicción de la estadía hospitalaria luego de una artroplastia total de cadera (ATC) electiva es crucial en la evaluación perioperatoria de los pacientes, con un rol determinante desde el punto de vista operacional y económico. Internacionalmente, se han empleado macrodatos (big data, en inglés) e inteligencia artificial para llevar a cabo evaluaciones pronósticas de este tipo. El objetivo del presente estudio es desarrollar y validar, con el empleo del aprendizaje de máquinas (machine learning, en inglés), una herramienta capaz de predecir la estadía hospitalaria de pacientes chilenos mayores de 65 años sometidos a ATC por artrosis. Material y Métodos Empleando los registros electrónicos de egresos hospitalarios anonimizados del Departamento de Estadísticas e Información de Salud (DEIS), se obtuvieron los datos de 8.970 egresos hospitalarios de pacientes sometidos a ATC por artrosis entre los años 2016 y 2018. En total, 15 variables disponibles en el DEIS, además del porcentaje de pobreza de la comuna de origen del paciente, fueron incluidos para predecir la probabilidad de que un paciente presentara una estadía acortada (< 3 días) o prolongada (> 3 días) luego de la cirugía. Utilizando técnicas de aprendizaje de máquinas, 8 algoritmos de predicción fueron entrenados con el 80% de la muestra. El 20% restante se empleó para validar las capacidades predictivas de los modelos creados a partir de los algoritmos. La métrica de optimización se evaluó y ordenó en un ranking utilizando el área bajo la curva de característica operativa del receptor (area under the receiver operating characteristic curve, AUC-ROC, en inglés), que corresponde a cuan bien un modelo puede distinguir entre dos grupos. Resultados El algoritmo XGBoost obtuvo el mejor desempeño, con una AUC-ROC promedio de 0,86 (desviación estándar [DE]: 0,0087). En segundo lugar, observamos que el algoritmo lineal de máquina de vector de soporte (support vector machine, SVM, en inglés) obtuvo una AUC-ROC de 0,85 (DE: 0,0086). La importancia relativa de las variables explicativas demostró que la región de residencia, el servicio de salud, el establecimiento de salud donde se operó el paciente, y la modalidad de atención son las variables que más determinan el tiempo de estadía de un paciente. Discusión El presente estudio desarrolló algoritmos de aprendizaje de máquinas basados en macrodatos chilenos de libre acceso, y logró desarrollar y validar una herramienta que demuestra una adecuada capacidad discriminatoria para predecir la probabilidad de estadía hospitalaria acortada versus prolongada en adultos mayores sometidos a ATC por artrosis. Conclusión Los algoritmos creados a traves del empleo del aprendizaje de máquinas permiten predecir la estadía hospitalaria en pacientes chilenos operado de artroplastia total de cadera electiva.

Download Full-text

Machine Learning in Apache Spark Environment for Diagnosis of Diabetes

10.20944/preprints202111.0200.v1 ◽

2021 ◽

Author(s):

Farshid Bagheri Saravi ◽

Shadi Moghanian ◽

Giti Javidi ◽

Ehsan O Sheybani

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Big Data ◽

Random Forest ◽

Apache Spark ◽

Support Vector ◽

Computing Environment ◽

Large Dataset ◽

Related Data

Disease-related data and information collected by physicians, patients, and researchers seem insignificant at first glance. Still, the same unorganized data contain valuable information that is often hidden. The task of data mining techniques is to extract patterns to classify the data accurately. One of the various Data mining and its methods have been used often to diagnose various diseases. In this study, a machine learning (ML) technique based on distributed computing in the Apache Spark computing space is used to diagnose diabetics or hidden pattern of the illness to detect the disease using a large dataset in real-time. Implementation results of three ML techniques of Decision Tree (DT) technique or Random Forest (RF) or Support Vector Machine (SVM) in the Apache Spark computing environment using the Scala programming language and WEKA show that RF is more efficient and faster to diagnose diabetes in big data.

Download Full-text

PEMODELAN PENGARUH SENTIMEN PUBLIK DI TWITTER TERKAIT POLITIK TERHADAP KURS RUPIAH DAN INDEKS HARGA SAHAM DENGAN PENDEKATAN MACHINE LEARNING

Seminar Nasional Official Statistics ◽

10.34123/semnasoffstat.v2020i1.575 ◽

2021 ◽

Vol 2020 (1) ◽

pp. 989-999

Author(s):

Epan Mareza Primahendra ◽

Budi Yuniarto

Keyword(s):

Neural Network ◽

Machine Learning ◽

Support Vector Machine ◽

Big Data ◽

Least Square ◽

Support Vector ◽

Hidden Layer ◽

Hidden Neurons

Kurs Rupiah dan indeks harga saham (IHS) berpengaruh terhadap perekonomian Indonesia. Pergerakan kurs Rupiah dan IHS dipengaruhi oleh, informasi publik, kondisi sosial, dan politik. Kejadian politik banyak menimbulkan sentimen dari masyarakat. Sentimen tersebut banyak disampaikan melalui media sosial terutama Twitter. Twitter merupakan sumber big data yang jika datanya tidak dimanfaatkan akan menjadi sampah. Pengumpulan data dilakukan pada periode 26 September 2019 - 27 Oktober 2019. Pola jumlah tweets harian yang sesuai dengan pergerakan kurs Rupiah dan IHS mengindikasikan bahwa terdapat hubungan antara sentimen di Twitter terkait situasi politik terhadap kurs Rupiah dan IHS. Penelitian ini menggunakan pendekatan machine learning dengan algoritma Neural Network dan Least Square Support Vector Machine. Penelitian ini bertujuan untuk mengetahui pengaruh sentimen terhadap kurs Rupiah dan IHS sekaligus mengkaji kedua algoritmanya. Hasilnya menjelaskan bahwa model terbaik untuk estimasi IHS yaitu NN dengan 1 hidden layer dan 2 hidden neurons. Modelnya menunjukan bahwa terdapat pengaruh antara sentimen tersebut terhadap IHS karena volatilitas estimasi IHS sudah cukup mengikuti pola pergerakan IHS aktual. Model terbaik untuk estimasi kurs Rupiah yaitu LSSVM. Pola pergerakan estimasi kurs Rupiah cenderung stagnan di atas nilai aktual. Ini mengindikasikan bahwa modelnya masih belum memuaskan dalam mengestimasi pengaruh sentimen publik terhadap kurs Rupiah.

Download Full-text

Comparative Evaluation of Machine Learning Strategies for Analyzing Big Data in Psychiatry

International Journal of Molecular Sciences ◽

10.3390/ijms19113387 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3387 ◽

Cited By ~ 5

Author(s):

Han Cao ◽

Andreas Meyer-Lindenberg ◽

Emanuel Schwarz

Keyword(s):

Machine Learning ◽

Big Data ◽

Learning Strategies ◽

Big Data Analytics ◽

Machine Learning Algorithms ◽

Mental Illnesses ◽

Support Vector ◽

Success Factor ◽

Flexible Tool ◽

Task Learning

The requirement of innovative big data analytics has become a critical success factor for research in biological psychiatry. Integrative analyses across distributed data resources are considered essential for untangling the biological complexity of mental illnesses. However, little is known about algorithm properties for such integrative machine learning. Here, we performed a comparative analysis of eight machine learning algorithms for identification of reproducible biological fingerprints across data sources, using five transcriptome-wide expression datasets of schizophrenia patients and controls as a use case. We found that multi-task learning (MTL) with network structure (MTL_NET) showed superior accuracy compared to other MTL formulations as well as single task learning, and tied performance with support vector machines (SVM). Compared to SVM, MTL_NET showed significant benefits regarding the variability of accuracy estimates, as well as its robustness to cross-dataset and sampling variability. These results support the utility of this algorithm as a flexible tool for integrative machine learning in psychiatry.

Download Full-text

Big data Analytics Using Support Vector Machine

2018 International Conference on Soft-computing and Network Security (ICSNS) ◽

10.1109/icsns.2018.8573641 ◽

2018 ◽

Cited By ~ 1

Author(s):

P. Amudha ◽

S. Sivakumari

Keyword(s):

Support Vector Machine ◽

Big Data ◽

Data Analytics ◽

Big Data Analytics ◽

Support Vector

Download Full-text

Breast Cancer Prediction Using Machine Learning Algorithm with Big Data Concept

International Journal of Scientific Research in Science Engineering and Technology ◽

10.32628/ijsrset1207232 ◽

2020 ◽

pp. 123-127

Author(s):

R. Nirmalan ◽

M. Javith Hussain Khan ◽

V. Sounder ◽

A. Manikkaraja

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Big Data ◽

Learning Algorithm ◽

Support Vector ◽

Data Set ◽

Cancer Prediction ◽

Modern Computer ◽

Huge Data

The evolution in modern computer technology produce an huge amount of data by the way of using updated technology world with the lot and lot of inventions. The algorithms which we used in machine-learning traditionally might not support the concept of big data. Here we have discussed and implemented the solution for the problem, while predicting breast cancer using big data. DNA methylation (DM) as well gene expression (GE) are the two types of data used for the prediction of breast cancer. The main objective is to classify individual data set in the separate manner. To achieve this main objective, we have used a platform Apache Spark. Here,we have applied three types of algorithms used for classification, they are decision tree, random forest algorithm, support vector machine algorithm which will be mentioned as SVM .These three types of algorithm used for producing models used for breast cancer prediction. Analyze have done for finding which algorithm will produce the better result with good accuracy and less error rate. Additionally, the platforms like Weka and Spark are compared, to find which will have the better performance while dealing with the huge data. The obtained outcome have proved that the Support Vector Machine classifier which is scalable might given the better performance than all other classifiers and it have achieved the lowest error range with the highest accuracy using GE data set

Download Full-text