scholarly journals Crouching TIGER, hidden structure: Exploring the nature of linguistic data using TIGER values

Author(s):  
Kaj Syrjänen ◽  
Luke Maurits ◽  
Unni Leino ◽  
Terhi Honkola ◽  
Jadranka Rota ◽  
...  

Abstract In recent years, techniques such as Bayesian inference of phylogeny have become a standard part of the quantitative linguistic toolkit. While these tools successfully model the tree-like component of a linguistic dataset, real-world datasets generally include a combination of tree-like and nontree-like signals. Alongside developing techniques for modeling nontree-like data, an important requirement for future quantitative work is to build a principled understanding of this structural complexity of linguistic datasets. Some techniques exist for exploring the general structure of a linguistic dataset, such as NeighborNets, δ scores, and Q-residuals; however, these methods are not without limitations or drawbacks. In general, the question of what kinds of historical structure a linguistic dataset can contain and how these might be detected or measured remains critically underexplored from an objective, quantitative perspective. In this article, we propose TIGER values, a metric that estimates the internal consistency of a genetic dataset, as an additional metric for assessing how tree-like a linguistic dataset is. We use TIGER values to explore simulated language data ranging from very tree-like to completely unstructured, and also use them to analyze a cognate-coded basic vocabulary dataset of Uralic languages. As a point of comparison for the TIGER values, we also explore the same data using δ scores, Q-residuals, and NeighborNets. Our results suggest that TIGER values are capable of both ranking tree-like datasets according to their degree of treelikeness, as well as distinguishing datasets with tree-like structure from datasets with a nontree-like structure. Consequently, we argue that TIGER values serve as a useful metric for measuring the historical heterogeneity of datasets. Our results also highlight the complexities in measuring treelikeness from linguistic data, and how the metrics approach this question from different perspectives.

2021 ◽  
Vol 14 (10) ◽  
pp. 1859-1871
Author(s):  
Paolo Ciaccia ◽  
Davide Martinenghi ◽  
Riccardo Torlone

When composing multiple preferences characterizing the most suitable results for a user, several issues may arise. Indeed, preferences can be partially contradictory, suffer from a mismatch with the level of detail of the actual data, and even lack natural properties such as transitivity. In this paper we formally investigate the problem of retrieving the best results complying with multiple preferences expressed in a logic-based language. Data are stored in relational tables with taxonomic domains, which allow the specification of preferences also over values that are more generic than those in the database. In this framework, we introduce two operators that rewrite preferences for enforcing the important properties of transitivity, which guarantees soundness of the result, and specificity, which solves all conflicts among preferences. Although, as we show, these two properties cannot be fully achieved together, we use our operators to identify the only two alternatives that ensure transitivity and minimize the residual conflicts. Building on this finding, we devise a technique, based on an original heuristics, for selecting the best results according to the two possible alternatives. We finally show, with a number of experiments over both synthetic and real-world datasets, the effectiveness and practical feasibility of the overall approach.


Author(s):  
Lahari Poddar ◽  
Wynne Hsu ◽  
Mong Li Lee

User opinions expressed in the form of ratings can influence an individual's view of an item. However, the true quality of an item is often obfuscated by user biases, and it is not obvious from the observed ratings the importance different users place on different aspects of an item. We propose a probabilistic modeling of the observed aspect ratings to infer (i) each user's aspect bias and (ii) latent intrinsic quality of an item. We model multi-aspect ratings as ordered discrete data and encode the dependency between different aspects by using a latent Gaussian structure. We handle the Gaussian-Categorical non-conjugacy using a stick-breaking formulation coupled with P\'{o}lya-Gamma auxiliary variable augmentation for a simple, fully Bayesian inference. On two real world datasets, we demonstrate the predictive ability of our model and its effectiveness in learning explainable user biases to provide insights towards a more reliable product quality estimation.


2021 ◽  
Vol 21 (3) ◽  
pp. 1-17
Author(s):  
Wu Chen ◽  
Yong Yu ◽  
Keke Gai ◽  
Jiamou Liu ◽  
Kim-Kwang Raymond Choo

In existing ensemble learning algorithms (e.g., random forest), each base learner’s model needs the entire dataset for sampling and training. However, this may not be practical in many real-world applications, and it incurs additional computational costs. To achieve better efficiency, we propose a decentralized framework: Multi-Agent Ensemble. The framework leverages edge computing to facilitate ensemble learning techniques by focusing on the balancing of access restrictions (small sub-dataset) and accuracy enhancement. Specifically, network edge nodes (learners) are utilized to model classifications and predictions in our framework. Data is then distributed to multiple base learners who exchange data via an interaction mechanism to achieve improved prediction. The proposed approach relies on a training model rather than conventional centralized learning. Findings from the experimental evaluations using 20 real-world datasets suggest that Multi-Agent Ensemble outperforms other ensemble approaches in terms of accuracy even though the base learners require fewer samples (i.e., significant reduction in computation costs).


Data ◽  
2020 ◽  
Vol 6 (1) ◽  
pp. 1
Author(s):  
Ahmed Elmogy ◽  
Hamada Rizk ◽  
Amany M. Sarhan

In data mining, outlier detection is a major challenge as it has an important role in many applications such as medical data, image processing, fraud detection, intrusion detection, and so forth. An extensive variety of clustering based approaches have been developed to detect outliers. However they are by nature time consuming which restrict their utilization with real-time applications. Furthermore, outlier detection requests are handled one at a time, which means that each request is initiated individually with a particular set of parameters. In this paper, the first clustering based outlier detection framework, (On the Fly Clustering Based Outlier Detection (OFCOD)) is presented. OFCOD enables analysts to effectively find out outliers on time with request even within huge datasets. The proposed framework has been tested and evaluated using two real world datasets with different features and applications; one with 699 records, and another with five millions records. The experimental results show that the performance of the proposed framework outperforms other existing approaches while considering several evaluation metrics.


Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 680
Author(s):  
Hanyang Lin ◽  
Yongzhao Zhan ◽  
Zizheng Zhao ◽  
Yuzhong Chen ◽  
Chen Dong

There is a wealth of information in real-world social networks. In addition to the topology information, the vertices or edges of a social network often have attributes, with many of the overlapping vertices belonging to several communities simultaneously. It is challenging to fully utilize the additional attribute information to detect overlapping communities. In this paper, we first propose an overlapping community detection algorithm based on an augmented attribute graph. An improved weight adjustment strategy for attributes is embedded in the algorithm to help detect overlapping communities more accurately. Second, we enhance the algorithm to automatically determine the number of communities by a node-density-based fuzzy k-medoids process. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed algorithms can effectively detect overlapping communities with fewer parameters compared to the baseline methods.


2021 ◽  
Vol 15 (3) ◽  
pp. 1-33
Author(s):  
Wenjun Jiang ◽  
Jing Chen ◽  
Xiaofei Ding ◽  
Jie Wu ◽  
Jiawei He ◽  
...  

In online systems, including e-commerce platforms, many users resort to the reviews or comments generated by previous consumers for decision making, while their time is limited to deal with many reviews. Therefore, a review summary, which contains all important features in user-generated reviews, is expected. In this article, we study “how to generate a comprehensive review summary from a large number of user-generated reviews.” This can be implemented by text summarization, which mainly has two types of extractive and abstractive approaches. Both of these approaches can deal with both supervised and unsupervised scenarios, but the former may generate redundant and incoherent summaries, while the latter can avoid redundancy but usually can only deal with short sequences. Moreover, both approaches may neglect the sentiment information. To address the above issues, we propose comprehensive Review Summary Generation frameworks to deal with the supervised and unsupervised scenarios. We design two different preprocess models of re-ranking and selecting to identify the important sentences while keeping users’ sentiment in the original reviews. These sentences can be further used to generate review summaries with text summarization methods. Experimental results in seven real-world datasets (Idebate, Rotten Tomatoes Amazon, Yelp, and three unlabelled product review datasets in Amazon) demonstrate that our work performs well in review summary generation. Moreover, the re-ranking and selecting models show different characteristics.


Complexity ◽  
2018 ◽  
Vol 2018 ◽  
pp. 1-16 ◽  
Author(s):  
Jibing Wu ◽  
Lianfei Yu ◽  
Qun Zhang ◽  
Peiteng Shi ◽  
Lihua Liu ◽  
...  

The heterogeneous information networks are omnipresent in real-world applications, which consist of multiple types of objects with various rich semantic meaningful links among them. Community discovery is an effective method to extract the hidden structures in networks. Usually, heterogeneous information networks are time-evolving, whose objects and links are dynamic and varying gradually. In such time-evolving heterogeneous information networks, community discovery is a challenging topic and quite more difficult than that in traditional static homogeneous information networks. In contrast to communities in traditional approaches, which only contain one type of objects and links, communities in heterogeneous information networks contain multiple types of dynamic objects and links. Recently, some studies focus on dynamic heterogeneous information networks and achieve some satisfactory results. However, they assume that heterogeneous information networks usually follow some simple schemas, such as bityped network and star network schema. In this paper, we propose a multityped community discovery method for time-evolving heterogeneous information networks with general network schemas. A tensor decomposition framework, which integrates tensor CP factorization with a temporal evolution regularization term, is designed to model the multityped communities and address their evolution. Experimental results on both synthetic and real-world datasets demonstrate the efficiency of our framework.


2017 ◽  
Vol 27 (1) ◽  
pp. 169-180 ◽  
Author(s):  
Marton Szemenyei ◽  
Ferenc Vajda

Abstract Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.


2019 ◽  
Vol 2019 (1) ◽  
pp. 26-46 ◽  
Author(s):  
Thee Chanyaswad ◽  
Changchang Liu ◽  
Prateek Mittal

Abstract A key challenge facing the design of differential privacy in the non-interactive setting is to maintain the utility of the released data. To overcome this challenge, we utilize the Diaconis-Freedman-Meckes (DFM) effect, which states that most projections of high-dimensional data are nearly Gaussian. Hence, we propose the RON-Gauss model that leverages the novel combination of dimensionality reduction via random orthonormal (RON) projection and the Gaussian generative model for synthesizing differentially-private data. We analyze how RON-Gauss benefits from the DFM effect, and present multiple algorithms for a range of machine learning applications, including both unsupervised and supervised learning. Furthermore, we rigorously prove that (a) our algorithms satisfy the strong ɛ-differential privacy guarantee, and (b) RON projection can lower the level of perturbation required for differential privacy. Finally, we illustrate the effectiveness of RON-Gauss under three common machine learning applications – clustering, classification, and regression – on three large real-world datasets. Our empirical results show that (a) RON-Gauss outperforms previous approaches by up to an order of magnitude, and (b) loss in utility compared to the non-private real data is small. Thus, RON-Gauss can serve as a key enabler for real-world deployment of privacy-preserving data release.


Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2868
Author(s):  
Wenxuan Zhao ◽  
Yaqin Zhao ◽  
Liqi Feng ◽  
Jiaxi Tang

The purpose of image dehazing is the reduction of the image degradation caused by suspended particles for supporting high-level visual tasks. Besides the atmospheric scattering model, convolutional neural network (CNN) has been used for image dehazing. However, the existing image dehazing algorithms are limited in face of unevenly distributed haze and dense haze in real-world scenes. In this paper, we propose a novel end-to-end convolutional neural network called attention enhanced serial Unet++ dehazing network (AESUnet) for single image dehazing. We attempt to build a serial Unet++ structure that adopts a serial strategy of two pruned Unet++ blocks based on residual connection. Compared with the simple Encoder–Decoder structure, the serial Unet++ module can better use the features extracted by encoders and promote contextual information fusion in different resolutions. In addition, we take some improvement measures to the Unet++ module, such as pruning, introducing the convolutional module with ResNet structure, and a residual learning strategy. Thus, the serial Unet++ module can generate more realistic images with less color distortion. Furthermore, following the serial Unet++ blocks, an attention mechanism is introduced to pay different attention to haze regions with different concentrations by learning weights in the spatial domain and channel domain. Experiments are conducted on two representative datasets: the large-scale synthetic dataset RESIDE and the small-scale real-world datasets I-HAZY and O-HAZY. The experimental results show that the proposed dehazing network is not only comparable to state-of-the-art methods for the RESIDE synthetic datasets, but also surpasses them by a very large margin for the I-HAZY and O-HAZY real-world dataset.


Sign in / Sign up

Export Citation Format

Share Document