Large-Scale Information Extraction from Emails with Data Constraints

Author(s):  
Rajeev Gupta ◽  
Ranganath Kondapally ◽  
Siddharth Guha
2002 ◽  
Vol 8 (2-3) ◽  
pp. 209-233 ◽  
Author(s):  
OLIVIER FERRET ◽  
BRIGITTE GRAU

Topic analysis is important for many applications dealing with texts, such as text summarization or information extraction. However, it can be done with great precision only if it relies on structured knowledge, which is difficult to produce on a large scale. In this paper, we propose using bootstrapping to solve this problem: a first topic analysis based on a weakly structured source of knowledge, a collocation network, is used for learning explicit topic representations that then support a more precise and reliable topic analysis.


2013 ◽  
Vol 427-429 ◽  
pp. 2122-2125
Author(s):  
Wen Zhao Liu ◽  
Li Min Niu ◽  
Jun Jie Chen

Under the aims at corpus automatic collection on the process of the large-scale Lexicography corpus collection, the paper described corpus collection Technology which is based on Web. And then, the paper introduces the book information corpus tool, including how to make use of the technique of search engine and information extraction in our system.


2022 ◽  
Vol 18 (1) ◽  
pp. 0-0

Social media data become an integral part in the business data and should be integrated into the decisional process for better decision making based on information which reflects better the true situation of business in any field. However, social media data are unstructured and generated in very high frequency which exceeds the capacity of the data warehouse. In this work, we propose to extend the data warehousing process with a staging area which heart is a large scale system implementing an information extraction process using Storm and Hadoop frameworks to better manage their volume and frequency. Concerning structured information extraction, mainly events, we combine a set of techniques from NLP, linguistic rules and machine learning to succeed the task. Finally, we propose the adequate data warehouse conceptual model for events modeling and integration with enterprise data warehouse using an intermediate table called Bridge table. For application and experiments, we focus on drug abuse events extraction from Twitter data and their modeling into the Event Data Warehouse.


Author(s):  
Bonan Min ◽  
Shuming Shi ◽  
Ralph Grishman ◽  
Chin-Yew Lin

The Web brings an open-ended set of semantic relations. Discovering the significant types is very challenging. Unsupervised algorithms have been developed to extract relations from a corpus without knowing the relation types in advance, but most rely on tagging arguments of predefined types. One recently reported system is able to jointly extract relations and their argument semantic classes, taking a set of relation instances extracted by an open IE (Information Extraction) algorithm as input. However, it cannot handle polysemy of relation phrases and fails to group many similar (“synonymous”) relation instances because of the sparseness of features. In this paper, the authors present a novel unsupervised algorithm that provides a more general treatment of the polysemy and synonymy problems. The algorithm incorporates various knowledge sources which they will show to be very effective for unsupervised relation extraction. Moreover, it explicitly disambiguates polysemous relation phrases and groups synonymous ones. While maintaining approximately the same precision, the algorithm achieves significant improvement on recall compared to the previous method. It is also very efficient. Experiments on a real-world dataset show that it can handle 14.7 million relation instances and extract a very large set of relations from the Web.


2009 ◽  
Vol 50 (4) ◽  
Author(s):  
Luis Sarmento

Abstract In this paper we will present a simple, yet effective, method for extracting terminology from technical text. The method is based on the observation that for technical domains it is much simpler to describe what a valid terminological unit cannot be than what it can possibly be. Our method relies on a set of filters that exclude multi-word units according to simple rules regarding their context and internal lexical structure, and it does not require any special pre-processing such as POS tagging. Rules were hand-coded in a simple incremental process and may be ported to several languages with little effort. Additionally, the method is able to process more than two million words per minute on a standard computer. Although the method was originally intended for semiautomatic terminological extraction, we believe that it can also be applied in fully automated procedures, making it appropriate for large-scale information extraction. We will start by explaining our main motivation for building this method and we will describe its role in a larger framework, the Corpógrafo. We will then present the process of building the current method, from the first very simple approaches to the current version, pointing out the problems encountered at each step. We will then present results of applying the current version of the extraction method to specific domain corpora in English. Finally, we will present future plans and explain how we are currently in the process of building a small semantic lexicon for helping future large-scale information extraction procedures.


2022 ◽  
Vol 8 ◽  
pp. e835
Author(s):  
David Schindler ◽  
Felix Bensmann ◽  
Stefan Dietze ◽  
Frank Krüger

Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.


Author(s):  
B. Xing ◽  
J. Li ◽  
H. Zhu ◽  
P. Wei ◽  
Y. Zhao

As a kind of marine natural disaster, Green Tide has been appearing every year along the Qingdao Coast, bringing great loss to this region, since the large-scale bloom in 2008. Therefore, it is of great value to obtain the real time dynamic information about green tide distribution. In this study, methods of optical remote sensing and microwave remote sensing are employed in Green Tide Monitoring Research. A specific remote sensing data processing flow and a green tide information extraction algorithm are designed, according to the optical and microwave data of different characteristics. In the aspect of green tide spatial distribution information extraction, an automatic extraction algorithm of green tide distribution boundaries is designed based on the principle of mathematical morphology dilation/erosion. And key issues in information extraction, including the division of green tide regions, the obtaining of basic distributions, the limitation of distribution boundary, and the elimination of islands, have been solved. The automatic generation of green tide distribution boundaries from the results of remote sensing information extraction is realized. Finally, a green tide monitoring system is built based on IDL/GIS secondary development in the integrated environment of RS and GIS, achieving the integration of RS monitoring and information extraction.


Author(s):  
Martin Atzmueller ◽  
Stephanie Beer ◽  
Frank Puppe

For large-scale data mining, utilizing data from ubiquitous and mixed-structured data sources, the extraction and integration into a comprehensive data-warehouse is usually of prime importance. Then, appropriate methods for validation and potential refinement are essential. This chapter describes an approach for integrating data mining, information extraction, and validation with collaborative knowledge management and capture in order to improve the data acquisition processes. For collaboration, a semantic wiki-enabled system for knowledge and experience management is presented. The proposed approach applies information extraction techniques together with pattern mining methods for initial data validation and is applicable for heterogeneous sources, i.e., capable of integrating structured and unstructured data. The methods are integrated into an incremental process providing for continuous validation options. The approach has been developed in a health informatics context: The results of a medical application demonstrate that pattern mining and the applied rule-based information extraction methods are well suited for discovering, extracting and validating clinically relevant knowledge, as well as the applicability of the knowledge capture approach. The chapter presents experiences using a case-study in the medical domain of sonography.


Sign in / Sign up

Export Citation Format

Share Document