Design and Comparative Analysis of New Personalized Recommender Algorithms with Specific Features for Large Scale Datasets

Nowadays, because of the tremendous amount of information that humans and machines produce every day, it has become increasingly hard to choose the more relevant content across a broad range of choices. This research focuses on the design of two different intelligent optimization methods using Artificial Intelligence and Machine Learning for real-life applications that are used to improve the process of generation of recommenders. In the first method, the modified cluster based intelligent collaborative filtering is applied with the sequential clustering that operates on the values of dataset, user′s neighborhood set, and the size of the recommendation list. This strategy splits the given data set into different subsets or clusters and the recommendation list is extracted from each group for constructing the better recommendation list. In the second method, the specific features-based customized recommender that works in the training and recommendation steps by applying the split and conquer strategy on the problem datasets, which are clustered into a minimum number of clusters and the better recommendation list, is created among all the clusters. This strategy automatically tunes the tuning parameter λ that serves the role of supervised learning in generating the better recommendation list for the large datasets. The quality of the proposed recommenders for some of the large scale datasets is improved compared to some of the well-known existing methods. The proposed methods work well when λ = 0.5 with the size of the recommendation list, |L| = 30 and the size of the neighborhood, |S| < 30. For a large value of |S|, the significant difference of the root mean square error becomes smaller in the proposed methods. For large scale datasets, simulation of the proposed methods when varying the user sizes and when the user size exceeds 500, the experimental results show that better values of the metrics are obtained and the proposed method 2 performs better than proposed method 1. The significant differences are obtained in these methods because the structure of computation of the methods depends on the number of user attributes, λ, the number of bipartite graph edges, and |L|. The better values of the (Precision, Recall) metrics obtained with size as 3000 for the large scale Book-Crossing dataset in the proposed methods are (0.0004, 0.0042) and (0.0004, 0.0046) respectively. The average computational time of the proposed methods takes <10 seconds for the large scale datasets and yields better performance compared to the well-known existing methods.

Download Full-text

Using Deficit Function to Determine the Minimum Fleet Size of an Autonomous Modular Public Transit System

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198120945981 ◽

2020 ◽

Vol 2674 (11) ◽

pp. 532-541 ◽

Cited By ~ 1

Author(s):

Tao Liu ◽

Avishai (Avi) Ceder ◽

Andreas Rau

Keyword(s):

Autonomous Vehicles ◽

Large Scale ◽

Public Transit ◽

Programming Model ◽

Real Life ◽

Single Line ◽

Continuum Approximation ◽

Transit System ◽

Fleet Size ◽

Minimum Number

Emerging technologies, such as connected and autonomous vehicles, electric vehicles, and information and communication, are surrounding us at an ever-increasing pace, which, together with the concept of shared mobility, have great potential to transform existing public transit (PT) systems into far more user-oriented, system-optimal, smart, and sustainable new PT systems with increased service connectivity, synchronization, and better, more satisfactory user experiences. This work analyses such a new PT system comprised of autonomous modular PT (AMPT) vehicles. In this analysis, one of the most challenging tasks is to accurately estimate the minimum number of vehicle modules, that is, its minimum fleet size (MFS), required to perform a set of scheduled services. The solution of the MFS problem of a single-line AMPT system is based on a graphical method, adapted from the deficit function (DF) theory. The traditional DF model has been extended to accommodate the definitions of an AMPT system. Some numerical examples are provided to illustrate the mathematical formulations. The limitations of traditional continuum approximation models and the equivalence between the extended DF model and an integer programming model are also provided. The extended DF model was applied, as a case study, to a single line of an AMPT system, the dynamic autonomous road transit (DART) system in Singapore. The results show that the extended DF model is effective in solving the MFS problem and has the potential to be applied to solving real-life MFS problems of large-scale, multi-line and multi-terminal AMPT systems.

Download Full-text

An AI-Based Automated Continuous Compliance Awareness Framework (CoCAF) for Procurement Auditing

Big Data and Cognitive Computing ◽

10.3390/bdcc4030023 ◽

2020 ◽

Vol 4 (3) ◽

pp. 23

Author(s):

Ke Wang ◽

Michael Zipperle ◽

Marius Becherer ◽

Florian Gottwalt ◽

Yu Zhang

Keyword(s):

Large Scale ◽

High Efficiency ◽

Real Life ◽

Internal Audit ◽

Rating System ◽

Internal Auditing ◽

Audit Process ◽

Data Set ◽

Compliance Level ◽

Potential Risks

Compliance management for procurement internal auditing has been a major challenge for public sectors due to its tedious period of manual audit history and large-scale paper-based repositories. Many practical issues and potential risks arise during the manual audit process, including a low level of efficiency, accuracy, accountability, high expense and its laborious and time consuming nature. To alleviate these problems, this paper proposes a continuous compliance awareness framework (CoCAF). It is defined as an AI-based automated approach to conduct procurement compliance auditing. CoCAF is used to automatically and timely audit an organisation’s purchases by intelligently understanding compliance policies and extracting the required information from purchasing evidence using text extraction technologies, automatic processing methods and a report rating system. Based on the auditing results, the CoCAF can provide a continuously updated report demonstrating the compliance level of the procurement with statistics and diagrams. The CoCAF is evaluated on a real-life procurement data set, and results show that it can process 500 purchasing pieces of evidence within five minutes and provide 95.6% auditing accuracy, demonstrating its high efficiency, quality and assurance level in procurement internal audit.

Download Full-text

Solving the Large-Scale TSP Problem in 1 h: Santa Claus Challenge 2020

Frontiers in Robotics and AI ◽

10.3389/frobt.2021.689908 ◽

2021 ◽

Vol 8 ◽

Author(s):

Radu Mariescu-Istodor ◽

Pasi Fränti

Keyword(s):

Large Scale ◽

Computing Time ◽

Real Life ◽

Problem Instance ◽

Santa Claus ◽

Design Choice ◽

Long Time ◽

Problem Instances ◽

The Given ◽

Important Design

The scalability of traveling salesperson problem (TSP) algorithms for handling large-scale problem instances has been an open problem for a long time. We arranged a so-called Santa Claus challenge and invited people to submit their algorithms to solve a TSP problem instance that is larger than 1 M nodes given only 1 h of computing time. In this article, we analyze the results and show which design choices are decisive in providing the best solution to the problem with the given constraints. There were three valid submissions, all based on local search, including k-opt up to k = 5. The most important design choice turned out to be the localization of the operator using a neighborhood graph. The divide-and-merge strategy suffers a 2% loss of quality. However, via parallelization, the result can be obtained within less than 2 min, which can make a key difference in real-life applications.

Download Full-text

Multiple-Membership Survival Analysis and Its Applications in Organizational Behavior and Management Research

Organizational Research Methods ◽

10.1177/1094428119877452 ◽

2019 ◽

pp. 109442811987745

Author(s):

Hans Tierens ◽

Nicky Dries ◽

Mike Smet ◽

Luc Sels

Keyword(s):

Organizational Behavior ◽

Large Scale ◽

Model Building ◽

Real Life ◽

Hierarchical Structures ◽

Survival Model ◽

Survival Models ◽

Data Set ◽

Potential Applications ◽

Multiple Membership

Multilevel paradigms have permeated organizational research in recent years, greatly advancing our understanding of organizational behavior and management decisions. Despite the advancements made in multilevel modeling, taking into account complex hierarchical structures in data remains challenging. This is particularly the case for models used for predicting the occurrence and timing of events and decisions—often referred to as survival models. In this study, the authors construct a multilevel survival model that takes into account subjects being nested in multiple environments—known as a multiple-membership structure. Through this article, the authors provide a step-by-step guide to building a multiple-membership survival model, illustrating each step with an application on a real-life, large-scale, archival data set. Easy-to-use R code is provided for each model-building step. The article concludes with an illustration of potential applications of the model to answer alternative research questions in the organizational behavior and management fields.

Download Full-text

PV Forecast for the Optimal Operation of the Medium Voltage Distribution Network: A Real-Life Implementation on a Large Scale Pilot

Energies ◽

10.3390/en13205330 ◽

2020 ◽

Vol 13 (20) ◽

pp. 5330

Author(s):

Aleksandar Dimovski ◽

Matteo Moncecchi ◽

Davide Falabretti ◽

Marco Merlo

Keyword(s):

Data Exchange ◽

Large Scale ◽

Weather Forecasting ◽

Distribution Network ◽

Real Life ◽

Parameter Tuning ◽

Optimal Operation ◽

Computational Effort ◽

Solar Irradiation ◽

Data Set

The goal of the paper is to develop an online forecasting procedure to be adopted within the H2020 InteGRIDy project, where the main objective is to use the photovoltaic (PV) forecast for optimizing the configuration of a distribution network (DN). Real-time measurements are obtained and saved for nine photovoltaic plants in a database, together with numerical weather predictions supplied from a commercial weather forecasting service. Adopting several error metrics as a performance index, as well as a historical data set for one of the plants on the DN, a preliminary analysis is performed investigating multiple statistical methods, with the objective of finding the most suitable one in terms of accuracy and computational effort. Hourly forecasts are performed each 6 h, for a horizon of 72 h. Having found the random forest method as the most suitable one, further hyper-parameter tuning of the algorithm was performed to improve performance. Optimal results with respect to normalized root mean square error (NRMSE) were found when training the algorithm using solar irradiation and a time vector, with a dataset consisting of 21 days. It was concluded that adding more features does not improve the accuracy when adopting relatively small training sets. Furthermore, the error was not significantly affected by the horizon of the forecast, where the 72-h horizon forecast showed an error increment of slightly above 2% when compared to the 6-h forecast. Thanks to the InteGRIDy project, the proposed algorithms were tested in a large scale real-life pilot, allowing the validation of the mathematical approach, but taking also into account both, problems related to faults in the telecommunication grids, as well as errors in the data exchange and storage procedures. Such an approach is capable of providing a proper quantification of the performances in a real-life scenario.

Download Full-text

High-Dimensional Image Data Sets Retrieval: Improving Accuracy Using a Weighted Relevance Feedback

International Journal of Semantic Computing ◽

10.1142/s1793351x1540005x ◽

2015 ◽

Vol 09 (02) ◽

pp. 239-259

Author(s):

Abir Gallas ◽

Walid Barhoumi ◽

Ezzeddine Zagrouba

Keyword(s):

Relevance Feedback ◽

Large Scale ◽

Nearest Neighbor ◽

Image Data ◽

High Dimensional ◽

General Context ◽

Query Image ◽

Data Set ◽

Dimensional Image ◽

Minimum Number

The user's interaction with the retrieval engines, while seeking a particular image (or set of images) in large-scale databases, defines better his request. This interaction is essentially provided by a relevance feedback step. In fact, the semantic gap is increasing in a remarkable way due to the application of approximate nearest neighbor (ANN) algorithms aiming at resolving the curse of dimensionality. Therefore, an additional step of relevance feedback is necessary in order to get closer to the user's expectations in the next few retrieval iterations. In this context, this paper details a classification of the different relevance feedback techniques related to region-based image retrieval applications. Moreover, a technique of relevance feedback based on re-weighting regions of the query-image by selecting a set of negative examples is elaborated. Furthermore, the general context to carry out this technique which is the large-scale heterogeneous image collections indexing and retrieval is presented. In fact, the main contribution of the proposed work is affording efficient results with the minimum number of relevance feedback iterations for high dimensional image databases. Experiments and assessments are carried out within an RBIR system for "Wang" data set in order to prove the effectiveness of the proposed approaches.

Download Full-text

Recursive Genetic Micro-Aggregation Technique: Information Loss, Disclosure Risk and Scoring Index

Data ◽

10.3390/data6050053 ◽

2021 ◽

Vol 6 (5) ◽

pp. 53

Author(s):

Ebaa Fayyoumi ◽

Omar Alhuniti

Keyword(s):

Real Life ◽

Convergence Condition ◽

Information Loss ◽

General Information ◽

Divide And Conquer ◽

Computational Time ◽

Data Set ◽

Disclosure Risk ◽

Aggregation Technique ◽

Scoring Index

This research investigates the micro-aggregation problem in secure statistical databases by integrating the divide and conquer concept with a genetic algorithm. This is achieved by recursively dividing a micro-data set into two subsets based on the proximity distance similarity. On each subset the genetic operation “crossover” is performed until the convergence condition is satisfied. The recursion will be terminated if the size of the generated subset is satisfied. Eventually, the genetic operation “mutation” will be performed over all generated subsets that satisfied the variable group size constraint in order to maximize the objective function. Experimentally, the proposed micro-aggregation technique was applied to recommended real-life data sets. Results demonstrated a remarkable reduction in the computational time, which sometimes exceeded 70% compared to the state-of-the-art. Furthermore, a good equilibrium value of the Scoring Index (SI) was achieved by involving a linear combination of the General Information Loss (GIL) and the General Disclosure Risk (GDR).

Download Full-text

Business Model of Sustainable Robo-Advisors: Empirical Insights for Practical Implementation

Sustainability ◽

10.3390/su132313009 ◽

2021 ◽

Vol 13 (23) ◽

pp. 13009

Author(s):

Cam-Duc Au ◽

Lars Klingenberger ◽

Martin Svoboda ◽

Eric Frère

Keyword(s):

Research Question ◽

Practical Implementation ◽

Data Set ◽

Policy Makers ◽

Significant Difference ◽

Research Findings ◽

Private Investors ◽

Logit Regression Model ◽

The Given ◽

Asset Managers

The given research paper examines the characteristics of German private investors regarding the probability of using robo-advisory-services. The used data set was gathered for this purpose (N = 305) to address the research question by using a logistic regression approach. The presented logit regression model results indicate that the awareness of sustainable aspects make a significant difference in the probability of using a sustainable robo-service. Additionally, our findings show that being male and cost-aware are positively associated with the use of a sustainable robo-advisor. Furthermore, the probability of use is 1.53 times higher among young and experienced investors. The findings in this paper provide relevant research findings for banks, asset managers, FinTechs, policy makers and financial practitioners to increase the adoption rate of robo-advice by introducing a sustainable offering.

Download Full-text

ON THE COMPLEXITY OF MAPPING LINEAR CHAIN APPLICATIONS ONTO HETEROGENEOUS PLATFORMS

Parallel Processing Letters ◽

10.1142/s0129626409000298 ◽

2009 ◽

Vol 19 (03) ◽

pp. 383-397 ◽

Cited By ~ 4

Author(s):

ANNE BENOIT ◽

YVES ROBERT ◽

ERIC THIERRY

Keyword(s):

Large Scale ◽

Broad Class ◽

Linear Chain ◽

Real Life ◽

Interval Mapping ◽

Constant Factor ◽

Data Set ◽

Heterogeneous Platforms ◽

Mapping Problem ◽

One To One

In this paper, we explore the problem of mapping linear chain applications onto large-scale heterogeneous platforms. A series of data sets enter the input stage and progress from stage to stage until the final result is computed. An important optimization criterion that should be considered in such a framework is the latency, or makespan, which measures the response time of the system in order to process one single data set entirely. For such applications, which are representative of a broad class of real-life applications, we can consider one-to-one mappings, in which each stage is mapped onto a single processor. However, in order to reduce the communication cost, it seems natural to group stages into intervals. The interval mapping problem can be solved in a straightforward way if the platform has homogeneous communications: the whole chain is grouped into a single interval, which in turn is mapped onto the fastest processor. But the problem becomes harder when considering a fully heterogeneous platform. Indeed, we prove the NP-completeness of this problem. Furthermore, we prove that neither the interval mapping problem nor the similar one-to-one mapping problem can be approximated in polynomial time by any constant factor (unless P=NP).

Download Full-text

Grey Wolf Algorithm-Based Clustering Technique

Journal of Intelligent Systems ◽

10.1515/jisys-2014-0137 ◽

2017 ◽

Vol 26 (1) ◽

pp. 153-168 ◽

Cited By ~ 14

Author(s):

Vijay Kumar ◽

Jitender Kumar Chhabra ◽

Dinesh Kumar

Keyword(s):

Real Life ◽

Feature Space ◽

Data Sets ◽

Grey Wolf ◽

Data Set ◽

Local Optima ◽

Clustering Technique ◽

Real Life Data ◽

The Given ◽

Grey Wolf Algorithm

AbstractThe main problem of classical clustering technique is that it is easily trapped in the local optima. An attempt has been made to solve this problem by proposing the grey wolf algorithm (GWA)-based clustering technique, called GWA clustering (GWAC), through this paper. The search capability of GWA is used to search the optimal cluster centers in the given feature space. The agent representation is used to encode the centers of clusters. The proposed GWAC technique is tested on both artificial and real-life data sets and compared to six well-known metaheuristic-based clustering techniques. The computational results are encouraging and demonstrate that GWAC provides better values in terms of precision, recall, G-measure, and intracluster distances. GWAC is further applied for gene expression data set and its performance is compared to other techniques. Experimental results reveal the efficiency of the GWAC over other techniques.

Download Full-text