Dynamic Distributed and Parallel Machine Learning algorithms for big data mining processing

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Laouni Djafri

PurposeThis work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies.Design/methodology/approachIn the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors’ proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic Distributed and Parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at two levels using the stratified random sampling method. This sampling method is also applied to extract the shared learning base (SLB) and the partial learning base for the first level (PLBL1) and the partial learning base for the second level (PLBL2). The experimental results show the efficiency of our solution that the authors provided without significant loss of the classification results. Thus, in practical terms, the system DDPML is generally dedicated to big data mining processing, and works effectively in distributed systems with a simple structure, such as client-server networks.FindingsThe authors got very satisfactory classification results.Originality/valueDDPML system is specially designed to smoothly handle big data mining classification.

Author(s):  
M. Govindarajan

Big data mining involves knowledge discovery from these large data sets. The purpose of this chapter is to provide an analysis of different machine learning algorithms available for performing big data analytics. The machine learning algorithms are categorized in three key categories, namely, supervised, unsupervised, and semi-supervised machine learning algorithm. The supervised learning algorithms are trained with a complete set of data, and thus, the supervised learning algorithms are used to predict/forecast. Example algorithms include logistic regression and the back propagation neural network. The unsupervised learning algorithms starts learning from scratch, and therefore, the unsupervised learning algorithms are used for clustering. Example algorithms include: the Apriori algorithm and K-Means. The semi-supervised learning combines both supervised and unsupervised learning algorithms. The semi-supervised algorithms are trained, and the algorithms also include non-trained learning.


Author(s):  
Kağan Okatan

All these types of analytics have been answering business questions for a long time about the principal methods of investigating data warehouses. Especially data mining and business intelligence systems support decision makers to reach the information they want. Many existing systems are trying to keep up with a phenomenon that has changed the rules of the game in recent years. This is undoubtedly the undeniable attraction of 'big data'. In particular, the issue of evaluating the big data generated especially by social media is among the most up-to-date issues of business analytics, and this issue demonstrates the importance of integrating machine learning into business analytics. This section introduces the prominent machine learning algorithms that are increasingly used for business analytics and emphasizes their application areas.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Roberto Salazar-Reyna ◽  
Fernando Gonzalez-Aleu ◽  
Edgar M.A. Granda-Gutierrez ◽  
Jenny Diaz-Ramirez ◽  
Jose Arturo Garza-Reyes ◽  
...  

PurposeThe objective of this paper is to assess and synthesize the published literature related to the application of data analytics, big data, data mining and machine learning to healthcare engineering systems.Design/methodology/approachA systematic literature review (SLR) was conducted to obtain the most relevant papers related to the research study from three different platforms: EBSCOhost, ProQuest and Scopus. The literature was assessed and synthesized, conducting analysis associated with the publications, authors and content.FindingsFrom the SLR, 576 publications were identified and analyzed. The research area seems to show the characteristics of a growing field with new research areas evolving and applications being explored. In addition, the main authors and collaboration groups publishing in this research area were identified throughout a social network analysis. This could lead new and current authors to identify researchers with common interests on the field.Research limitations/implicationsThe use of the SLR methodology does not guarantee that all relevant publications related to the research are covered and analyzed. However, the authors' previous knowledge and the nature of the publications were used to select different platforms.Originality/valueTo the best of the authors' knowledge, this paper represents the most comprehensive literature-based study on the fields of data analytics, big data, data mining and machine learning applied to healthcare engineering systems.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Babacar Gaye ◽  
Dezheng Zhang ◽  
Aziguli Wulamu

With the rapid development of the Internet and the rapid development of big data analysis technology, data mining has played a positive role in promoting industry and academia. Classification is an important problem in data mining. This paper explores the background and theory of support vector machines (SVM) in data mining classification algorithms and analyzes and summarizes the research status of various improved methods of SVM. According to the scale and characteristics of the data, different solution spaces are selected, and the solution of the dual problem is transformed into the classification surface of the original space to improve the algorithm speed. Research Process. Incorporating fuzzy membership into multicore learning, it is found that the time complexity of the original problem is determined by the dimension, and the time complexity of the dual problem is determined by the quantity, and the dimension and quantity constitute the scale of the data, so it can be based on the scale of the data Features Choose different solution spaces. The algorithm speed can be improved by transforming the solution of the dual problem into the classification surface of the original space. Conclusion. By improving the calculation rate of traditional machine learning algorithms, it is concluded that the accuracy of the fitting prediction between the predicted data and the actual value is as high as 98%, which can make the traditional machine learning algorithm meet the requirements of the big data era. It can be widely used in the context of big data.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Mirpouya Mirmozaffari ◽  
Elham Shadkam ◽  
Seyyed Mohammad Khalili ◽  
Kamyar Kabirifar ◽  
Reza Yazdani ◽  
...  

Purpose Cement as one of the major components of construction activities, releases a tremendous amount of carbon dioxide (CO2) into the atmosphere, resulting in adverse environmental impacts and high energy consumption. Increasing demand for CO2 consumption has urged construction companies and decision-makers to consider ecological efficiency affected by CO2 consumption. Therefore, this paper aims to develop a method capable of analyzing and assessing the eco-efficiency determining factor in Iran’s 22 local cement companies over 2015–2019. Design/methodology/approach This research uses two well-known artificial intelligence approaches, namely, optimization data envelopment analysis (DEA) and machine learning algorithms at the first and second steps, respectively, to fulfill the research aim. Meanwhile, to find the superior model, the CCR model, BBC model and additive DEA models to measure the efficiency of decision processes are used. A proportional decreasing or increasing of inputs/outputs is the main concern in measuring efficiency which neglect slacks, and hence, is a critical limitation of radial models. Thus, the additive model by considering desirable and undesirable outputs, as a well-known DEA non-proportional and non-radial model, is used to solve the problem. Additive models measure efficiency via slack variables. Considering both input-oriented and output-oriented is one of the main advantages of the additive model. Findings After applying the proposed model, the Malmquist productivity index is computed to evaluate the productivity of companies over 2015–2019. Although DEA is an appreciated method for evaluating, it fails to extract unknown information. Thus, machine learning algorithms play an important role in this step. Association rules are used to extract hidden rules and to introduce the three strongest rules. Finally, three data mining classification algorithms in three different tools have been applied to introduce the superior algorithm and tool. A new converting two-stage to single-stage model is proposed to obtain the eco-efficiency of the whole system. This model is proposed to fix the efficiency of a two-stage process and prevent the dependency on various weights. Converting undesirable outputs and desirable inputs to final desirable inputs in a single-stage model to minimize inputs, as well as turning desirable outputs to final desirable outputs in the single-stage model to maximize outputs to have a positive effect on the efficiency of the whole process. Originality/value The performance of the proposed approach provides us with a chance to recognize pattern recognition of the whole, combining DEA and data mining techniques during the selected period (five years from 2015 to 2019). Meanwhile, the cement industry is one of the foremost manufacturers of naturally harmful material using an undesirable by-product; specific stress is given to that pollution control investment or undesirable output while evaluating energy use efficiency. The significant concentration of the study is to respond to five preliminary questions.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Feifei Sun ◽  
Guohong Shi

PurposeThis paper aims to effectively explore the application effect of big data techniques based on an α-support vector machine-stochastic gradient descent (SVMSGD) algorithm in third-party logistics, obtain the valuable information hidden in the logistics big data and promote the logistics enterprises to make more reasonable planning schemes.Design/methodology/approachIn this paper, the forgetting factor is introduced without changing the algorithm's complexity and proposed an algorithm based on the forgetting factor called the α-SVMSGD algorithm. The algorithm selectively deletes or retains the historical data, which improves the adaptability of the classifier to the real-time new logistics data. The simulation results verify the application effect of the algorithm.FindingsWith the increase of training times, the test error percentages of gradient descent (GD) algorithm, gradient descent support (SGD) algorithm and the α-SVMSGD algorithm decrease gradually; in the process of logistics big data processing, the α-SVMSGD algorithm has the efficiency of SGD algorithm while ensuring that the GD direction approaches the optimal solution direction and can use a small amount of data to obtain more accurate results and enhance the convergence accuracy.Research limitations/implicationsThe threshold setting of the forgetting factor still needs to be improved. Setting thresholds for different data types in self-learning has become a research direction. The number of forgotten data can be effectively controlled through big data processing technology to improve data support for the normal operation of third-party logistics.Practical implicationsIt can effectively reduce the time-consuming of data mining, realize the rapid and accurate convergence of sample data without increasing the complexity of samples, improve the efficiency of logistics big data mining, reduce the redundancy of historical data, and has a certain reference value in promoting the development of logistics industry.Originality/valueThe classification algorithm proposed in this paper has feasibility and high convergence in third-party logistics big data mining. The α-SVMSGD algorithm proposed in this paper has a certain application value in real-time logistics data mining, but the design of the forgetting factor threshold needs to be improved. In the future, the authors will continue to study how to set different data type thresholds in self-learning.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Ashwin A. Phatak ◽  
Franz-Georg Wieland ◽  
Kartik Vempala ◽  
Frederik Volkmar ◽  
Daniel Memmert

AbstractWith the rising amount of data in the sports and health sectors, a plethora of applications using big data mining have become possible. Multiple frameworks have been proposed to mine, store, preprocess, and analyze physiological vitals data using artificial intelligence and machine learning algorithms. Comparatively, less research has been done to collect potentially high volume, high-quality ‘big data’ in an organized, time-synchronized, and holistic manner to solve similar problems in multiple fields. Although a large number of data collection devices exist in the form of sensors. They are either highly specialized, univariate and fragmented in nature or exist in a lab setting. The current study aims to propose artificial intelligence-based body sensor network framework (AIBSNF), a framework for strategic use of body sensor networks (BSN), which combines with real-time location system (RTLS) and wearable biosensors to collect multivariate, low noise, and high-fidelity data. This facilitates gathering of time-synchronized location and physiological vitals data, which allows artificial intelligence and machine learning (AI/ML)-based time series analysis. The study gives a brief overview of wearable sensor technology, RTLS, and provides use cases of AI/ML algorithms in the field of sensor fusion. The study also elaborates sample scenarios using a specific sensor network consisting of pressure sensors (insoles), accelerometers, gyroscopes, ECG, EMG, and RTLS position detectors for particular applications in the field of health care and sports. The AIBSNF may provide a solid blueprint for conducting research and development, forming a smooth end-to-end pipeline from data collection using BSN, RTLS and final stage analytics based on AI/ML algorithms.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Jiake Fu ◽  
Huijing Tian ◽  
Lingguang Song ◽  
Mingchao Li ◽  
Shuo Bai ◽  
...  

PurposeThis paper presents a new approach of productivity estimation of cutter suction dredger operation through data mining and learning from real-time big data.Design/methodology/approachThe paper used big data, data mining and machine learning techniques to extract features of cutter suction dredgers (CSD) for predicting its productivity. ElasticNet-SVR (Elastic Net-Support Vector Machine) method is used to filter the original monitoring data. Along with the actual working conditions of CSD, 15 features were selected. Then, a box plot was used to clean the corresponding data by filtering out outliers. Finally, four algorithms, namely SVR (Support Vector Regression), XGBoost (Extreme Gradient Boosting), LSTM (Long-Short Term Memory Network) and BP (Back Propagation) Neural Network, were used for modeling and testing.FindingsThe paper provided a comprehensive forecasting framework for productivity estimation including feature selection, data processing and model evaluation. The optimal coefficient of determination (R2) of four algorithms were all above 80.0%, indicating that the features selected were representative. Finally, the BP neural network model coupled with the SVR model was selected as the final model.Originality/valueMachine-learning algorithm incorporating domain expert judgments was used to select predictive features. The final optimal coefficient of determination (R2) of the coupled model of BP neural network and SVR is 87.6%, indicating that the method proposed in this paper is effective for CSD productivity estimation.


Sign in / Sign up

Export Citation Format

Share Document