Data Science and Engineering
Latest Publications


TOTAL DOCUMENTS

174
(FIVE YEARS 91)

H-INDEX

13
(FIVE YEARS 6)

Published By Springer-Verlag

2364-1541, 2364-1185

Author(s):  
Sonal Tuteja ◽  
Rajeev Kumar

AbstractThe incorporation of heterogeneous data models into large-scale e-commerce applications incurs various complexities and overheads, such as redundancy of data, maintenance of different data models, and communication among different models for query processing. Graphs have emerged as data modelling techniques for large-scale applications with heterogeneous, schemaless, and relationship-centric data. Models exist for mapping different types of data to a graph; however, the unification of data from heterogeneous source models into a graph model has not received much attention. To address this, we propose a new framework in this study. The proposed framework first transforms data from various source models into graph models individually and then unifies them into a single graph. To justify the applicability of the proposed framework in e-commerce applications, we analyse and compare query performance, scalability, and database size of the unified graph with heterogeneous source data models for a predefined set of queries. We also access some qualitative measures, such as flexibility, completeness, consistency, and maturity for the proposed unified graph. Based on the experimental results, the unified graph outperforms heterogeneous source models for query performance and scalability; however, it falls behind for database size.


Author(s):  
Ada Bagozi ◽  
Devis Bianchini ◽  
Valeria De Antonellis

AbstractCyber-physical systems are hybrid networked cyber and engineered physical elements that record data (e.g. using sensors), analyse them using connected services, influence physical processes and interact with human actors using multi-channel interfaces. Examples of CPS interacting with humans in industrial production environments are the so-called cyber-physical production systems (CPPS), where operators supervise the industrial machines, according to the human-in-the-loop paradigm. In this scenario, research challenges for implementing CPPS resilience, promptly reacting to faults, concern: (i) the complex structure of CPPS, which cannot be addressed as a monolithic system, but as a dynamic ecosystem of single CPS interacting and influencing each other; (ii) the volume, velocity and variety of data (Big Data) on which resilience is based, which call for novel methods and techniques to ensure recovery procedures; (iii) the involvement of human factors in these systems. In this paper, we address the design of resilient cyber-physical production systems (R-CPPS) in digital factories by facing these challenges. Specifically, each component of the R-CPPS is modelled as a smart machine, that is, a cyber-physical system equipped with a set of recovery services, a Sensor Data API used to collect sensor data acquired from the physical side for monitoring the component behaviour, and an operator interface for displaying detected anomalous conditions and notifying necessary recovery actions to on-field operators. A context-based mediator, at shop floor level, is in charge of ensuring resilience by gathering data from the CPPS, selecting the proper recovery actions and invoking corresponding recovery services on the target CPS. Finally, data summarisation and relevance evaluation techniques are used for supporting the identification of anomalous conditions in the presence of high volume and velocity of data collected through the Sensor Data API. The approach is validated in a food industry real case study.


Author(s):  
Anirban Mondal ◽  
Ayaan Kakkar ◽  
Nilesh Padhariya ◽  
Mukesh Mohania

AbstractNext-generation enterprise management systems are beginning to be developed based on the Systems of Engagement (SOE) model. We visualize an SOE as a set of entities. Each entity is modeled by a single parent document with dynamic embedded links (i.e., child documents) that contain multi-modal information about the entity from various networks. Since entities in an SOE are generally queried using keywords, our goal is to efficiently retrieve the top-k entities related to a given keyword-based query by considering the relevance scores of both their parent and child documents. Furthermore, we extend the afore-mentioned problem to incorporate the case where the entities are geo-tagged. The main contributions of this work are three-fold. First, it proposes an efficient bitmap-based approach for quickly identifying the candidate set of entities, whose parent documents contain all queried keywords. A variant of this approach is also proposed to reduce memory consumption by exploiting skews in keyword popularity. Second, it proposes the two-tier HI-tree index, which uses both hashing and inverted indexes, for efficient document relevance score lookups. Third, it proposes an R-tree-based approach to extend the afore-mentioned approaches for the case where the entities are geo-tagged. Fourth, it performs comprehensive experiments with both real and synthetic datasets to demonstrate that our proposed schemes are indeed effective in providing good top-k result recall performance within acceptable query response times.


Author(s):  
Luyan Xu ◽  
Xuan Zhou

AbstractEvaluation of interactive search systems and study of users’ struggling search behaviors require a significant number of search tasks. However, generation of such tasks is inherently difficult, as each task is supposed to trigger struggling search behavior rather than simple search behavior. To the best of our knowledge, there has not been a commonly used task set for research in struggling search. Moreover, the everchanging landscape of information needs would render old task sets less ideal if not unusable for evaluation. To deal with this problem, we propose a crowd-powered task generation method and develop a platform to efficiently generate struggling search tasks on basis of online wikis such as Wikipedia. Our experiments and analysis show that the generated tasks are qualified to emulate struggling search behaviors consisting of “repeated similar queries” and “quick-back clicks”; tasks of diverse topics, high quality and difficulty can be created using this method. For benefit of the community, we publicly released a task generation platform TaskGenie, a task set of 80 topically diverse struggling search tasks with “baselines,” and the corresponding anonymized user behavior logs.


Author(s):  
Deepak P ◽  
Savitha Sam Abraham

AbstractAn outlier detection method may be considered fair over specified sensitive attributes if the results of outlier detection are not skewed toward particular groups defined on such sensitive attributes. In this paper, we consider the task of fair outlier detection. Our focus is on the task of fair outlier detection over multiple multi-valued sensitive attributes (e.g., gender, race, religion, nationality and marital status, among others), one that has broad applications across modern data scenarios. We propose a fair outlier detection method, FairLOF, that is inspired by the popular LOF formulation for neighborhood-based outlier detection. We outline ways in which unfairness could be induced within LOF and develop three heuristic principles to enhance fairness, which form the basis of the FairLOF method. Being a novel task, we develop an evaluation framework for fair outlier detection, and use that to benchmark FairLOF on quality and fairness of results. Through an extensive empirical evaluation over real-world datasets, we illustrate that FairLOF is able to achieve significant improvements in fairness at sometimes marginal degradations on result quality as measured against the fairness-agnostic LOF method. We also show that a generalization of our method, named FairLOF-Flex, is able to open possibilities of further deepening fairness in outlier detection beyond what is offered by FairLOF.


Author(s):  
Yong-Feng Ge ◽  
Jinli Cao ◽  
Hua Wang ◽  
Zhenxiang Chen ◽  
Yanchun Zhang

AbstractBy breaking sensitive associations between attributes, database fragmentation can protect the privacy of outsourced data storage. Database fragmentation algorithms need prior knowledge of sensitive associations in the tackled database and set it as the optimization objective. Thus, the effectiveness of these algorithms is limited by prior knowledge. Inspired by the anonymity degree measurement in anonymity techniques such as k-anonymity, an anonymity-driven database fragmentation problem is defined in this paper. For this problem, a set-based adaptive distributed differential evolution (S-ADDE) algorithm is proposed. S-ADDE adopts an island model to maintain population diversity. Two set-based operators, i.e., set-based mutation and set-based crossover, are designed in which the continuous domain in the traditional differential evolution is transferred to the discrete domain in the anonymity-driven database fragmentation problem. Moreover, in the set-based mutation operator, each individual’s mutation strategy is adaptively selected according to the performance. The experimental results demonstrate that the proposed S-ADDE is significantly better than the compared approaches. The effectiveness of the proposed operators is verified.


Author(s):  
Rubina Sarki ◽  
Khandakar Ahmed ◽  
Hua Wang ◽  
Yanchun Zhang ◽  
Jiangang Ma ◽  
...  

AbstractDiabetic eye disease (DED) is a cluster of eye problem that affects diabetic patients. Identifying DED is a crucial activity in retinal fundus images because early diagnosis and treatment can eventually minimize the risk of visual impairment. The retinal fundus image plays a significant role in early DED classification and identification. An accurate diagnostic model’s development using a retinal fundus image depends highly on image quality and quantity. This paper presents a methodical study on the significance of image processing for DED classification. The proposed automated classification framework for DED was achieved in several steps: image quality enhancement, image segmentation (region of interest), image augmentation (geometric transformation), and classification. The optimal results were obtained using traditional image processing methods with a new build convolution neural network (CNN) architecture. The new built CNN combined with the traditional image processing approach presented the best performance with accuracy for DED classification problems. The results of the experiments conducted showed adequate accuracy, specificity, and sensitivity.


Author(s):  
Harika Abburi ◽  
Pulkit Parikh ◽  
Niyati Chhaya ◽  
Vasudeva Varma

AbstractSexism, a permeate form of oppression, causes profound suffering through various manifestations. Given the increasing number of experiences of sexism shared online, categorizing these recollections automatically can support the battle against sexism, since it can promote successful evaluations by gender studies researchers and government representatives engaged in policy making. In this paper, we examine the fine-grained, multi-label classification of accounts (reports) of sexism. To the best of our knowledge, we consider substantially more categories of sexism than any related prior work through our 23-class problem formulation. Moreover, we present the first semi-supervised work for the multi-label classification of accounts describing any type(s) of sexism. We devise self-training-based techniques tailor-made for the multi-label nature of the problem to utilize unlabeled samples for augmenting the labeled set. We identify high textual diversity with respect to the existing labeled set as a desirable quality for candidate unlabeled instances and develop methods for incorporating it into our approach. We also explore ways of infusing class imbalance alleviation for multi-label classification into our semi-supervised learning, independently and in conjunction with the method involving diversity. In addition to data augmentation methods, we develop a neural model which combines biLSTM and attention with a domain-adapted BERT model in an end-to-end trainable manner. Further, we formulate a multi-level training approach in which models are sequentially trained using categories of sexism of different levels of granularity. Moreover, we devise a loss function that exploits any label confidence scores associated with the data. Several proposed methods outperform various baselines on a recently released dataset for multi-label sexism categorization across several standard metrics.


Author(s):  
Xin Wang ◽  
Bohan Li ◽  
Shiyu Yang
Keyword(s):  

Author(s):  
Rhea Mahajan ◽  
Vibhakar Mansotra

AbstractTwitter is one of the most popular micro-blogging and social networking platforms where users post their opinions, preferences, activities, thoughts, views, etc., in form of tweets within the limit of 280 characters. In order to study and analyse the social behavior and activities of a user across a region, it becomes necessary to identify the location of the tweet. This paper aims to predict geolocation of real-time tweets at the city level collected for a period of 30 days by using a combination of convolutional neural network and a bidirectional long short-term memory by extracting features within the tweets and features associated with the tweets. We have also compared our results with previous baseline models and the findings of our experiment show a significant improvement over baselines methods achieving an accuracy of 92.6 with a median error of 22.4 km at city level prediction.


Sign in / Sign up

Export Citation Format

Share Document