ACM Transactions on Knowledge Discovery from Data
Latest Publications


TOTAL DOCUMENTS

693
(FIVE YEARS 317)

H-INDEX

45
(FIVE YEARS 7)

Published By Association For Computing Machinery

1556-4681

2022 ◽  
Vol 16 (3) ◽  
pp. 1-21
Author(s):  
Heli Sun ◽  
Yang Li ◽  
Bing Lv ◽  
Wujie Yan ◽  
Liang He ◽  
...  

Graph representation learning aims at learning low-dimension representations for nodes in graphs, and has been proven very useful in several downstream tasks. In this article, we propose a new model, Graph Community Infomax (GCI), that can adversarial learn representations for nodes in attributed networks. Different from other adversarial network embedding models, which would assume that the data follow some prior distributions and generate fake examples, GCI utilizes the community information of networks, using nodes as positive(or real) examples and negative(or fake) examples at the same time. An autoencoder is applied to learn the embedding vectors for nodes and reconstruct the adjacency matrix, and a discriminator is used to maximize the mutual information between nodes and communities. Experiments on several real-world and synthetic networks have shown that GCI outperforms various network embedding methods on community detection tasks.


2022 ◽  
Vol 16 (3) ◽  
pp. 1-32
Author(s):  
Junchen Jin ◽  
Mark Heimann ◽  
Di Jin ◽  
Danai Koutra

While most network embedding techniques model the proximity between nodes in a network, recently there has been significant interest in structural embeddings that are based on node equivalences , a notion rooted in sociology: equivalences or positions are collections of nodes that have similar roles—i.e., similar functions, ties or interactions with nodes in other positions—irrespective of their distance or reachability in the network. Unlike the proximity-based methods that are rigorously evaluated in the literature, the evaluation of structural embeddings is less mature. It relies on small synthetic or real networks with labels that are not perfectly defined, and its connection to sociological equivalences has hitherto been vague and tenuous. With new node embedding methods being developed at a breakneck pace, proper evaluation, and systematic characterization of existing approaches will be essential to progress. To fill in this gap, we set out to understand what types of equivalences structural embeddings capture. We are the first to contribute rigorous intrinsic and extrinsic evaluation methodology for structural embeddings, along with carefully-designed, diverse datasets of varying sizes. We observe a number of different evaluation variables that can lead to different results (e.g., choice of similarity measure, classifier, and label definitions). We find that degree distributions within nodes’ local neighborhoods can lead to simple yet effective baselines in their own right and guide the future development of structural embedding. We hope that our findings can influence the design of further node embedding methods and also pave the way for more comprehensive and fair evaluation of structural embedding methods.


2022 ◽  
Vol 16 (3) ◽  
pp. 1-37
Author(s):  
Robert A. Sowah ◽  
Bernard Kuditchar ◽  
Godfrey A. Mills ◽  
Amevi Acakpovi ◽  
Raphael A. Twum ◽  
...  

Class imbalance problem is prevalent in many real-world domains. It has become an active area of research. In binary classification problems, imbalance learning refers to learning from a dataset with a high degree of skewness to the negative class. This phenomenon causes classification algorithms to perform woefully when predicting positive classes with new examples. Data resampling, which involves manipulating the training data before applying standard classification techniques, is among the most commonly used techniques to deal with the class imbalance problem. This article presents a new hybrid sampling technique that improves the overall performance of classification algorithms for solving the class imbalance problem significantly. The proposed method called the Hybrid Cluster-Based Undersampling Technique (HCBST) uses a combination of the cluster undersampling technique to under-sample the majority instances and an oversampling technique derived from Sigma Nearest Oversampling based on Convex Combination, to oversample the minority instances to solve the class imbalance problem with a high degree of accuracy and reliability. The performance of the proposed algorithm was tested using 11 datasets from the National Aeronautics and Space Administration Metric Data Program data repository and University of California Irvine Machine Learning data repository with varying degrees of imbalance. Results were compared with classification algorithms such as the K-nearest neighbours, support vector machines, decision tree, random forest, neural network, AdaBoost, naïve Bayes, and quadratic discriminant analysis. Tests results revealed that for the same datasets, the HCBST performed better with average performances of 0.73, 0.67, and 0.35 in terms of performance measures of area under curve, geometric mean, and Matthews Correlation Coefficient, respectively, across all the classifiers used for this study. The HCBST has the potential of improving the performance of the class imbalance problem, which by extension, will improve on the various applications that rely on the concept for a solution.


2022 ◽  
Vol 16 (3) ◽  
pp. 1-26
Author(s):  
Jerry Chun-Wei Lin ◽  
Youcef Djenouri ◽  
Gautam Srivastava ◽  
Yuanfa Li ◽  
Philip S. Yu

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-20
Author(s):  
Junkun Yuan ◽  
Anpeng Wu ◽  
Kun Kuang ◽  
Bo Li ◽  
Runze Wu ◽  
...  

Instrumental variables (IVs), sources of treatment randomization that are conditionally independent of the outcome, play an important role in causal inference with unobserved confounders. However, the existing IV-based counterfactual prediction methods need well-predefined IVs, while it’s an art rather than science to find valid IVs in many real-world scenes. Moreover, the predefined hand-made IVs could be weak or erroneous by violating the conditions of valid IVs. These thorny facts hinder the application of the IV-based counterfactual prediction methods. In this article, we propose a novel Automatic Instrumental Variable decomposition (AutoIV) algorithm to automatically generate representations serving the role of IVs from observed variables (IV candidates). Specifically, we let the learned IV representations satisfy the relevance condition with the treatment and exclusion condition with the outcome via mutual information maximization and minimization constraints, respectively. We also learn confounder representations by encouraging them to be relevant to both the treatment and the outcome. The IV and confounder representations compete for the information with their constraints in an adversarial game, which allows us to get valid IV representations for IV-based counterfactual prediction. Extensive experiments demonstrate that our method generates valid IV representations for accurate IV-based counterfactual prediction.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-22
Author(s):  
Mu Yuan ◽  
Lan Zhang ◽  
Xiang-Yang Li ◽  
Lin-Zhuo Yang ◽  
Hui Xiong

Labeling data (e.g., labeling the people, objects, actions, and scene in images) comprehensively and efficiently is a widely needed but challenging task. Numerous models were proposed to label various data and many approaches were designed to enhance the ability of deep learning models or accelerate them. Unfortunately, a single machine-learning model is not powerful enough to extract various semantic information from data. Given certain applications, such as image retrieval platforms and photo album management apps, it is often required to execute a collection of models to obtain sufficient labels. With limited computing resources and stringent delay, given a data stream and a collection of applicable resource-hungry deep-learning models, we design a novel approach to adaptively schedule a subset of these models to execute on each data item, aiming to maximize the value of the model output (e.g., the number of high-confidence labels). Achieving this lofty goal is nontrivial since a model’s output on any data item is content-dependent and unknown until we execute it. To tackle this, we propose an Adaptive Model Scheduling framework, consisting of (1) a deep reinforcement learning-based approach to predict the value of unexecuted models by mining semantic relationship among diverse models, and (2) two heuristic algorithms to adaptively schedule the model execution order under a deadline or deadline-memory constraints, respectively. The proposed framework does not require any prior knowledge of the data, which works as a powerful complement to existing model optimization technologies. We conduct extensive evaluations on five diverse image datasets and 30 popular image labeling models to demonstrate the effectiveness of our design: our design could save around 53% execution time without loss of any valuable labels.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-22
Author(s):  
Chang Liu ◽  
Jie Yan ◽  
Feiyue Guo ◽  
Min Guo

Although machine learning (ML) algorithms have been widely used in forecasting the trend of stock market indices, they failed to consider the following crucial aspects for market forecasting: (1) that investors’ emotions and attitudes toward future market trends have material impacts on market trend forecasting (2) the length of past market data should be dynamically adjusted according to the market status and (3) the transition of market statutes should be considered when forecasting market trends. In this study, we proposed an innovative ML method to forecast China's stock market trends by addressing the three issues above. Specifically, sentimental factors (see Appendix [1] for full trans) were first collected to measure investors’ emotions and attitudes. Then, a non-stationary Markov chain (NMC) model was used to capture dynamic transitions of market statutes. We choose the state-of-the-art (SOTA) method, namely, Bidirectional Encoder Representations from Transformers ( BERT ), to predict the state of the market at time t , and a long short-term memory ( LSTM ) model was used to estimate the varying length of past market data in market trend prediction, where the input of LSTM (the state of the market at time t ) was the output of BERT and probabilities for opening and closing of the gates in the LSTM model were based on outputs of the NMC model. Finally, the optimum parameters of the proposed algorithm were calculated using a reinforced learning-based deep Q-Network. Compared to existing forecasting methods, the proposed algorithm achieves better results with a forecasting accuracy of 61.77%, annualized return of 29.25%, and maximum losses of −8.29%. Furthermore, the proposed model achieved the lowest forecasting error: mean square error (0.095), root mean square error (0.0739), mean absolute error (0.104), and mean absolute percent error (15.1%). As a result, the proposed market forecasting model can help investors obtain more accurate market forecast information.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-16
Author(s):  
Fereshteh Jafariakinabad ◽  
Kien A. Hua

The syntactic structure of sentences in a document substantially informs about its authorial writing style. Sentence representation learning has been widely explored in recent years and it has been shown that it improves the generalization of different downstream tasks across many domains. Even though utilizing probing methods in several studies suggests that these learned contextual representations implicitly encode some amount of syntax, explicit syntactic information further improves the performance of deep neural models in the domain of authorship attribution. These observations have motivated us to investigate the explicit representation learning of syntactic structure of sentences. In this article, we propose a self-supervised framework for learning structural representations of sentences. The self-supervised network contains two components; a lexical sub-network and a syntactic sub-network which take the sequence of words and their corresponding structural labels as the input, respectively. Due to the n -to-1 mapping of words to their structural labels, each word will be embedded into a vector representation which mainly carries structural information. We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task. Our experimental results indicate that the structural embeddings significantly improve the classification tasks when concatenated with the existing pre-trained word embeddings.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-25
Author(s):  
Hanrui Wu ◽  
Michael K. Ng

Multi-source domain adaptation is a challenging topic in transfer learning, especially when the data of each domain are represented by different kinds of features, i.e., Multi-source Heterogeneous Domain Adaptation (MHDA). It is important to take advantage of the knowledge extracted from multiple sources as well as bridge the heterogeneous spaces for handling the MHDA paradigm. This article proposes a novel method named Multiple Graphs and Low-rank Embedding (MGLE), which models the local structure information of multiple domains using multiple graphs and learns the low-rank embedding of the target domain. Then, MGLE augments the learned embedding with the original target data. Specifically, we introduce the modules of both domain discrepancy and domain relevance into the multiple graphs and low-rank embedding learning procedure. Subsequently, we develop an iterative optimization algorithm to solve the resulting problem. We evaluate the effectiveness of the proposed method on several real-world datasets. Promising results show that the performance of MGLE is better than that of the baseline methods in terms of several metrics, such as AUC, MAE, accuracy, precision, F1 score, and MCC, demonstrating the effectiveness of the proposed method.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-33
Author(s):  
Danlu Liu ◽  
Yu Li ◽  
William Baskett ◽  
Dan Lin ◽  
Chi-Ren Shyu

Risk patterns are crucial in biomedical research and have served as an important factor in precision health and disease prevention. Despite recent development in parallel and high-performance computing, existing risk pattern mining methods still struggle with problems caused by large-scale datasets, such as redundant candidate generation, inability to discover long significant patterns, and prolonged post pattern filtering. In this article, we propose a novel dynamic tree structure, Risk Hierarchical Pattern Tree (RHPTree), and a top-down search method, RHPSearch, which are capable of efficiently analyzing a large volume of data and overcoming the limitations of previous works. The dynamic nature of the RHPTree avoids costly tree reconstruction for the iterative search process and dataset updates. We also introduce two specialized search methods, the extended target search (RHPSearch-TS) and the parallel search approach (RHPSearch-SD), to further speed up the retrieval of certain items of interest. Experiments on both UCI machine learning datasets and sampled datasets of the Simons Foundation Autism Research Initiative (SFARI)—Simon’s Simplex Collection (SSC) datasets demonstrate that our method is not only faster but also more effective in identifying comprehensive long risk patterns than existing works. Moreover, the proposed new tree structure is generic and applicable to other pattern mining problems.


Sign in / Sign up

Export Citation Format

Share Document