data summarization
Recently Published Documents


TOTAL DOCUMENTS

124
(FIVE YEARS 40)

H-INDEX

10
(FIVE YEARS 2)

Author(s):  
Soumya Dutta ◽  
Humayra Tasnim ◽  
Terece L. Turton ◽  
James Ahrens

2021 ◽  
Vol 37 (3) ◽  
pp. 239-278
Author(s):  
Nguyen Cat Ho

The study stands on the standpoint that there exist relationships between real-world structures and their provided information in reality. Such relationships are essential because the natural language plays a specifically vital and crucial role in, e.g., capturing, conveying information, and accumulating knowledge containing useful high-level information. Consequently, it must contain certain semantics structures, including linguistic (L-) variables’ semantic structures, which are fundamental, similar to the math variables’ structures. In this context, the fact that the (L-) variables’ word domains can be formalized as algebraic semantics-based structures in an axiomatic manner, called hedge algebras (HAs,)  is still a novel event and essential for developing computational methods to simulate the human capabilities in problem-solving based on the so-called natural language-based formalism. Hedge algebras were founded in 1990. Since then, HA-formalism has been significantly developed and applied to solve several application problems in many distinct fields, such as fuzzy control, data classification and regression, robotics, L-time series forecasting, and L-data summarization. The study gives a survey to summarize specific distinguishing fundamental features of HA-formalism, its applicability in problem-solving, and its performance. 


Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 392
Author(s):  
Sinead A. Williamson ◽  
Jette Henderson

Understanding how two datasets differ can help us determine whether one dataset under-represents certain sub-populations, and provides insights into how well models will generalize across datasets. Representative points selected by a maximum mean discrepancy (MMD) coreset can provide interpretable summaries of a single dataset, but are not easily compared across datasets. In this paper, we introduce dependent MMD coresets, a data summarization method for collections of datasets that facilitates comparison of distributions. We show that dependent MMD coresets are useful for understanding multiple related datasets and understanding model generalization between such datasets.


2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Jiawei Chen ◽  
Geoffrey Brown ◽  
Adam Fudickar

AbstractBio-loggers are widely used for studying the movement and behavior of animals. However, some sensors provide more data than is practical to store given experiment or bio-logger design constraints. One approach for overcoming this limitation is to utilize data collection strategies, such as non-continuous recording or data summarization that may record data more efficiently, but need to be validated for correctness. In this paper we address two fundamental questions—how can researchers determine suitable parameters and behaviors for bio-logger sensors, and how do they validate their choices? We present a methodology that uses software-based simulation of bio-loggers to validate various data collection strategies using recorded data and synchronized, annotated video. The use of simulation allows for fast and repeatable tests, which facilitates the validation of data collection methods as well as the configuration of bio-loggers in preparation for experiments. We demonstrate this methodology using accelerometer loggers for recording the activity of the small songbird Junco hyemalis hyemalis.


Author(s):  
Kai R. Larsen ◽  
Daniel S. Becker

Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations (cases) are no longer helpful to detect the signal, the feature(s), or conditions that inform the target. In addition to obtaining more observations, we can also look for additional features of interest that we do not currently have, at which point it will invariably be necessary to integrate data from different sources. This section introduces this process of data integration, starting with an introduction of two methods: “joins” (to access more features) and “unions” (to access more observations) and continues on to cover regular expressions, data summarization, crosstabs, data reduction and splitting, and data wrangling in all its flavors.


2021 ◽  
pp. 49-76
Author(s):  
Justin C. Touchon

The third chapter introduces the dataset that readers will work with for the remainder of the book: the RxP data. Students are taught how to import the data and how to use indexing and packages like dplyr to navigate this large, real, and somewhat messy set of ecological data. Basic plotting is introduced as a tool for exploratory data analysis. Lastly, students are taught how to summarize a dataset quickly and efficiently.


PLoS ONE ◽  
2021 ◽  
Vol 16 (6) ◽  
pp. e0252862
Author(s):  
Daniel Fernández-Álvarez ◽  
Johannes Frey ◽  
Jose Emilio Labra Gayo ◽  
Daniel Gayo-Avello ◽  
Sebastian Hellmann

The amount, size, complexity, and importance of Knowledge Graphs (KGs) have increased during the last decade. Many different communities have chosen to publish their datasets using Linked Data principles, which favors the integration of this information with many other sources published using the same principles and technologies. Such a scenario requires to develop techniques of Linked Data Summarization. The concept of a class is one of the core elements used to define the ontologies which sustain most of the existing KGs. Moreover, classes are an excellent tool to refer to an abstract idea which groups many individuals (or instances) in the context of a given KG, which is handy to use when producing summaries of its content. Rankings of class importance are a powerful summarization tool that can be used both to obtain a superficial view of the content of a given KG and to prioritize many different actions over the data (data quality checking, visualization, relevance for search engines…). In this paper, we analyze existing techniques to measure class importance and propose a novel approach called ClassRank. We compare the class usage in SPARQL logs of different KGs with the importance ranking produced by the approaches evaluated. Then, we discuss the strengths and weaknesses of the evaluated techniques. Our experimentation suggests that ClassRank outperforms state-of-the-art approaches measuring class importance.


Sign in / Sign up

Export Citation Format

Share Document