Multi-Document Summarization by Extended Graph Text Representation and Importance Refinement

Author(s):  
Uri Mirchev ◽  
Mark Last

Automatic multi-document summarization is aimed at recognizing important text content in a collection of topic-related documents and representing it in the form of a short abstract or extract. This chapter presents a novel approach to the multi-document summarization problem, focusing on the generic summarization task. The proposed SentRel (Sentence Relations) multi-document summarization algorithm assigns importance scores to documents and sentences in a collection based on two aspects: static and dynamic. In the static aspect, the significance score is recursively inferred from a novel, tripartite graph representation of the text corpus. In the dynamic aspect, the significance score is continuously refined with respect to the current summary content. The resulting summary is generated in the form of complete sentences exactly as they appear in the summarized documents, ensuring the summary's grammatical correctness. The proposed algorithm is evaluated on the TAC 2011 dataset using DUC 2001 for training and DUC 2004 for parameter tuning. The SentRel ROUGE-1 and ROUGE-2 scores are comparable to state-of-the-art summarization systems, which require a different set of textual entities.

2010 ◽  
Vol 6 (2) ◽  
pp. 41-58 ◽  
Author(s):  
Jing Lu ◽  
Weiru Chen ◽  
Malcolm Keech

Structural relation patterns have been introduced recently to extend the search for complex patterns often hidden behind large sequences of data. This has motivated a novel approach to sequential patterns post-processing and a corresponding data mining method was proposed for Concurrent Sequential Patterns (ConSP). This article refines the approach in the context of ConSP modelling, where a companion graph-based model is devised as an extension of previous work. Two new modelling methods are presented here together with a construction algorithm, to complete the transformation of concurrent sequential patterns to a ConSP-Graph representation. Customer orders data is used to demonstrate the effectiveness of ConSP mining while synthetic sample data highlights the strength of the modelling technique, illuminating the theories developed.


Author(s):  
Nevena Lazic ◽  
Amarnag Subramanya ◽  
Michael Ringgaard ◽  
Fernando Pereira

We present Plato, a probabilistic model for entity resolution that includes a novel approach for handling noisy or uninformative features, and supplements labeled training data derived from Wikipedia with a very large unlabeled text corpus. Training and inference in the proposed model can easily be distributed across many servers, allowing it to scale to over 107 entities. We evaluate Plato on three standard datasets for entity resolution. Our approach achieves the best results to-date on TAC KBP 2011 and is highly competitive on both the CoNLL 2003 and TAC KBP 2012 datasets.


2021 ◽  
Vol 9 (1) ◽  
pp. 1270-1282
Author(s):  
Venkateswara Rao P, A.P Siva kumar

The emerging trend in technical research is to use customer-generated data collected by community media to probe community opinion and scientific communication on employment and care issues. This review of the collected data, the launch of a question-and-answer social website, is a separate stack for exploring the key factors that influence public preferences for technical knowledge and opinions. by means of a web search engine, topic modeling, and regression data modeling, this study quantified the effect of the response textual and auxiliary functions on the number of votes received with the response. Compared to previous studies based on open estimates, the model results show that Quora users are more likely to only talk about technology. It can fail when the keywords in the query do not match the text content of large documents that contain relevant questions of existing methods, ie. CNNMF and NMF, as well as some restrictions are not enough. Also, users are often not experts and provide ambiguous queries leading to mixed results and encountering problems with existing methods. To address this problem, in this article we propose a Hadoop model, distributed using semantics, non-negative matrix factorization (HDiSANNMF), to find topics for short texts. It effectively incorporates the semantic correlations of the word context into the model, where the semantic connections between words and their context are learned by omitting the grammatical view of the corpus. The researchers are trying to reorganize the main results and present modern techniques for modeling distributed themes to address technologies and platforms with increasing attributes, as well as how much time and space it takes to generate the model. This document briefly describes the structure of public questions and answers around the world and tracks the development of the main topics Housing and employment opportunities for next generation technologies in the world in real time.


2021 ◽  
Vol 4 ◽  
Author(s):  
David Gordon ◽  
Panayiotis Petousis ◽  
Henry Zheng ◽  
Davina Zamanzadeh ◽  
Alex A.T. Bui

We present a novel approach for imputing missing data that incorporates temporal information into bipartite graphs through an extension of graph representation learning. Missing data is abundant in several domains, particularly when observations are made over time. Most imputation methods make strong assumptions about the distribution of the data. While novel methods may relax some assumptions, they may not consider temporality. Moreover, when such methods are extended to handle time, they may not generalize without retraining. We propose using a joint bipartite graph approach to incorporate temporal sequence information. Specifically, the observation nodes and edges with temporal information are used in message passing to learn node and edge embeddings and to inform the imputation task. Our proposed method, temporal setting imputation using graph neural networks (TSI-GNN), captures sequence information that can then be used within an aggregation function of a graph neural network. To the best of our knowledge, this is the first effort to use a joint bipartite graph approach that captures sequence information to handle missing data. We use several benchmark datasets to test the performance of our method against a variety of conditions, comparing to both classic and contemporary methods. We further provide insight to manage the size of the generated TSI-GNN model. Through our analysis we show that incorporating temporal information into a bipartite graph improves the representation at the 30% and 60% missing rate, specifically when using a nonlinear model for downstream prediction tasks in regularly sampled datasets and is competitive with existing temporal methods under different scenarios.


Author(s):  
Jing Lu ◽  
Weiru Chen ◽  
Malcolm Keech

Structural relation patterns have been introduced recently to extend the search for complex patterns often hidden behind large sequences of data. This has motivated a novel approach to sequential patterns post-processing and a corresponding data mining method was proposed for Concurrent Sequential Patterns (ConSP). This article refines the approach in the context of ConSP modelling, where a companion graph-based model is devised as an extension of previous work. Two new modelling methods are presented here together with a construction algorithm, to complete the transformation of concurrent sequential patterns to a ConSP-Graph representation. Customer orders data is used to demonstrate the effectiveness of ConSP mining while synthetic sample data highlights the strength of the modelling technique, illuminating the theories developed.


2017 ◽  
Vol 2017 ◽  
pp. 1-9
Author(s):  
Qi Ding ◽  
Xiafu Peng ◽  
Xunyu Zhong ◽  
Xiaoqiang Hu

A novel approach to fault diagnosis for a class of nonlinear uncertain systems with triangular form is proposed in this paper. It is based on the extended state observer (ESO) of the active disturbance rejection controller and linearization of dynamic compensation. Firstly, an ESO is designed to jointly estimate the states and the combination of uncertainty, faults, and nonlinear function of nonlinear uncertain systems. It can derive the estimation of nonlinear function via the state estimations and system model. Then, linearization of dynamic compensation is employed to linearize the system by offsetting nonlinear function mandatorily using its estimation. An observer-based residual generator is designed on the basis of the prior linearized model for fault diagnosis. Moreover, threshold treatment technique is adopted to improve the robustness of fault diagnosis. This method is utilizable and simple in construction and parameter tuning. And also we show the construction of ESO and give the corresponding convergence proof succinctly. Finally, a numerical example is presented to illustrate the validity of the proposed fault diagnosis scheme.


2019 ◽  
Vol 36 (1) ◽  
pp. 112-121 ◽  
Author(s):  
Cunliang Geng ◽  
Yong Jung ◽  
Nicolas Renaud ◽  
Vasant Honavar ◽  
Alexandre M J J Bonvin ◽  
...  

Abstract Motivation Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. Results Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein–protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. Availability and implementation The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 8 (2S8) ◽  
pp. 1366-1371

Topic modeling, such as LDA is considered as a useful tool for the statistical analysis of text document collections and other text-based data. Recently, topic modeling becomes an attractive researching field due to its wide applications. However, there are remained disadvantages of traditional topic modeling like as LDA due the shortcoming of bag-of-words (BOW) model as well as low-performance in handle large text corpus. Therefore, in this paper, we present a novel approach of topic model, called LDA-GOW, which is the combination of word co-occurrence, also called: graph-of-words (GOW) model and traditional LDA topic discovering model. The LDA-GOW topic model not only enable to extract more informative topics from text but also be able to leverage the topic discovering process from large-scaled text corpus. We test our proposed model in comparing with the traditional LDA topic model, within several standardized datasets, include: WebKB, Reuters-R8 and annotated scientific documents which are collected from ACM digital library to demonstrate the effectiveness of our proposed model. For overall experiments, our proposed LDA-GOW model gains approximately 70.86% in accuracy.


2016 ◽  
Vol 25 (01) ◽  
pp. 1660002 ◽  
Author(s):  
Guangbing Yang

Oft-decried information overload is a serious problem that negatively impacts the comprehension of information in the digital age. Text summarization is a helpful process that can be used to alleviate this problem. With the aim of seeking a novel method to enhance the performance of multi-document summarization, this study proposes a novel approach to analyze the problem of multi-document summarization based on a mixture model, consisting of a contextual topic model from a Bayesian hierarchical topic modeling family for selecting candidate summary sentences, and a regression model in machine learning for generating the summary. By investigating hierarchical topics and their correlations with respect to the lexical co-occurrences of words, the proposed contextual topic model can determine the relevance of sentences more effectively, recognize latent topics, and arrange them hierarchically. The quantitative evaluation results from a practical application demonstrates that a system implementing this model can significantly improve the performance of summarization and make it comparable to state-of-the-art summarization systems.


Sign in / Sign up

Export Citation Format

Share Document