content extraction Latest Research Papers

An Incremental Acquisition Method for Web Forensics

International Journal of Digital Crime and Forensics ◽

10.4018/ijdcf.2021110116 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1-13

Author(s):

Guangxuan Chen ◽

Guangxiao Chen ◽

Lei Zhang ◽

Qiang Liu

Keyword(s):

Real Time ◽

Digital Forensics ◽

Recall Rate ◽

Web Pages ◽

Web Page ◽

Content Extraction ◽

Data Redundancy ◽

Repeated Acquisition ◽

Low Efficiency ◽

Acquisition Method

In order to solve the problems of repeated acquisition, data redundancy and low efficiency in the process of website forensics, this paper proposes an incremental acquisition method orientecd to dynamic websites. This method realized the incremental collection on dynamically updated websites through acquiring and parsing web pages, URL deduplication, web page denoising, web page content extraction and hashing. Experiments show that the algorithm has relative high acquisition precision and recall rate, and can be combined with other data to perform effective digital forensics on dynamically updated real-time websites.

A secure and efficient certificateless content extraction signature with privacy protection

PLoS ONE ◽

10.1371/journal.pone.0258907 ◽

2021 ◽

Vol 16 (11) ◽

pp. e0258907

Author(s):

Can Zhao ◽

Jiabing Liu ◽

Fuyong Zheng ◽

Dejun Wang ◽

Bo Meng

Keyword(s):

Privacy Protection ◽

Formal Analysis ◽

Discrete Logarithm ◽

Random Oracle Model ◽

Random Oracle ◽

Bilinear Pairing ◽

Security And Privacy ◽

Analysis Tool ◽

Content Extraction ◽

Key Aspects

Efficiency and privacy are the key aspects in content extraction signatures. In this study, we proposed a Secure and Efficient and Certificateless Content Extraction Signature with Privacy Protection (SECCESPP) in which scalar multiplication of elliptic curves is used to replace inefficient bilinear pairing of certificateless public key cryptosystem, and the signcryption idea is borrowed to implement privacy protection for signed messages. The correctness of the SECCESPP scheme is demonstrated by the consistency of the message and the accuracy of the equation. The security and privacy of the SECCESPP scheme are demonstrated based on the elliptic curve discrete logarithm problem in the random oracle model and are formally analyzed with the formal analysis tool ProVerif, respectively. Theory and experimental analysis show that the SECCESPP scheme is more efficient than other schemes.

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records

10.1145/3475720.3484443 ◽

2021 ◽

Author(s):

Nouf Alrasheed ◽

Shivika Prasanna ◽

Ryan Rowland ◽

Praveen Rao ◽

Viviana Grieco ◽

...

Keyword(s):

Deep Learning ◽

Spanish Colonial ◽

Content Extraction ◽

Learning Techniques

Feature Extraction Model with Group-Based Classifier for Content Extraction from Video Data

Revue d intelligence artificielle ◽

10.18280/ria.350407 ◽

2021 ◽

Vol 35 (4) ◽

pp. 325-330

Author(s):

Gowrisankar Kalakoti ◽

Prabakaran G

Keyword(s):

Selection Procedure ◽

Video Data ◽

Video Content ◽

Content Extraction ◽

Object Locations ◽

Core Issue ◽

Video Content Extraction ◽

Information Image ◽

Extraction Model ◽

Subsequent Improvement

In today's PC illustration, numerous object locations of videos are quite critical duties to accomplish. Swiftly and reliably recognising and distinguishing the multiple aspects of a video is a crucial attribute for collaborating with one's condition (object). The core issue is that in theory, to ensure that no significant aspect is missing; all aspects of a content in a video must be scanned for elements on various different scales. It requires some investment and effort anyway, to really arrange the substance of a given content region and both time and computational limits that an operator can spend on classification are constrained. Two presumption procedures for accelerating the standard identifier are performed by the proposed method and demonstrate their capability by performing both identification efficiency and velocity. The main enhancement of our group-based classifier focuses on accelerating the grouping of sub features by planning the problem as a selection procedure for consecutive features. The subsequent improvement gives better multiscale features to distinguish objects of all sizes without rescaling the information image from a video. Extracting contents from video is an assortment of successive images with a steady time interim. So video can give more data about contents in it when situations are changing regarding time. Along these lines, physically taking care of contents with features are very unimaginable. In the proposed work, it is suggested that a Group-based Video Content Extraction Classifier (GbCCE) extracts content from a video by extracting relevant features using a group-based classifier. The proposed method is distinct from conventional approaches and the findings indicate that better output is demonstrated by the proposed method.

VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining ◽

10.1145/3447548.3467126 ◽

2021 ◽

Author(s):

Shaunak Mishra ◽

Mikhail Kuznetsov ◽

Gaurav Srivastava ◽

Maxim Sviridenko

Keyword(s):

Image Search ◽

Content Extraction

Page-Level Main Content Extraction From Heterogeneous Webpages

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451168 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-105

Author(s):

Julián Alarte ◽

Josep Silva

Keyword(s):

Data Mining ◽

Real Time ◽

Computing Time ◽

Content Adaptation ◽

Object Model ◽

Storage Space ◽

Extraction Techniques ◽

Content Extraction ◽

Noisy Information ◽

A New Technique

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.

Survey Paper on Web Content Extraction & Classification

2021 6th International Conference for Convergence in Technology (I2CT) ◽

10.1109/i2ct51068.2021.9417947 ◽

2021 ◽

Author(s):

Dipali Shete ◽

Sachin Bojewar ◽

Ankit Sanghvi

Keyword(s):

Web Content ◽

Content Extraction ◽

Survey Paper

Lecture Information Service Based on Multiple Features Fusion

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021400076 ◽

2021 ◽

Vol 31 (04) ◽

pp. 545-562

Author(s):

Zhongguo Yang ◽

Mingzhu Zhang ◽

Zhongmei Zhang ◽

Han Li ◽

Chen Liu ◽

...

Keyword(s):

Information Service ◽

Extraction Methods ◽

Visual Similarity ◽

Multiple Features ◽

Content Extraction ◽

Features Fusion ◽

Open Information Extraction ◽

Extraction Algorithm ◽

Effectiveness And Efficiency ◽

The University

Information service is always a hot topic especially when the Web is accessible anywhere. In university, lecture information is very important for students and teachers who want to take part in academic meetings. Therefore, lecture news extraction is an important and imperative task. Many open information extraction methods have been proposed, but due to the high heterogeneity of websites, this task is still a challenge. In this paper, we propose a method based on fusing multiple features to locate lecture news on the university website. These features include the linked relationship between parent webpage and child webpages, the visual similarity, and the semantics of webpages. Additionally, this paper provides an information service based on a main content extraction algorithm for extracting the lecture information. Stable and invariant features enable the proposed method to adapt to various kinds of campus websites. The experiments conducted on 50 websites show the effectiveness and efficiency of the provided service.

ChexMix: A Literature Content Extraction Tool for Bioentities

10.1101/2021.03.09.434525 ◽

2021 ◽

Author(s):

Heejung Yang ◽

Beomjun Park ◽

Jinyoung Park ◽

Jiho Lee ◽

Hyeon Seok Jang ◽

...

Keyword(s):

Native Plants ◽

Biomedical Literature ◽

Medical Subject Headings ◽

Unique Identifier ◽

Content Extraction ◽

As Species ◽

Species Analysis ◽

Mining Tool ◽

Taxonomic Tree ◽

Text Mining Tool

AbstractBiomedical databases grow by more than a thousand new publications every day. The large volume of biomedical literature that is being published at an unprecedented rate hinders the discovery of relevant knowledge from keywords of interest to gather new insights and form hypotheses. A text-mining tool, PubTator, helps to automatically annotate bioentities, such as species, chemicals, genes, and diseases, from PubMed abstracts and full-text articles. However, the manual re-organization and analysis of bioentities is a non-trivial and highly time-consuming task. ChexMix was designed to extract the unique identifiers of bioentities from query results. Herein, ChexMix was used to construct a taxonomic tree with allied species among Korean native plants and to extract the medical subject headings unique identifier of the bioentities, which co-occurred with the keywords in the same literature. ChexMix discovered the allied species related to a keyword of interest and experimentally proved its usefulness for multi-species analysis.

Cyanidin 3-O-galactoside: A Natural Compound with Multiple Health Benefits

International Journal of Molecular Sciences ◽

10.3390/ijms22052261 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2261

Author(s):

Zhongxin Liang ◽

Hongrui Liang ◽

Yizhan Guo ◽

Dong Yang

Keyword(s):

Food Additives ◽

Antioxidant Properties ◽

Extraction Methods ◽

Uridine Diphosphate ◽

Content Extraction ◽

Physiological Functions ◽

Wide Range ◽

Potential Applications ◽

Multiple Health ◽

Uridine Diphosphate Galactose

Cyanidin 3-O-galactoside (Cy3Gal) is one of the most widespread anthocyanins that positively impacts the health of animals and humans. Since it is available from a wide range of natural sources, such as fruits (apples and berries in particular), substantial studies were performed to investigate its biosynthesis, chemical stability, natural occurrences and content, extraction methods, physiological functions, as well as potential applications. In this review, we focus on presenting the previous studies on the abovementioned aspects of Cy3Gal. As a conclusion, Cy3Gal shares a common biosynthesis pathway and analogous stability with other anthocyanins. Galactosyltransferase utilizing uridine diphosphate galactose (UDP-galactose) and cyanidin as substrates is unique for Cy3Gal biosynthesis. Extraction employing different methods reveals chokeberry as the most practical natural source for mass-production of this compound. The antioxidant properties and other health effects, including anti-inflammatory, anticancer, antidiabetic, anti-toxicity, cardiovascular, and nervous protective capacities, are highlighted in purified Cy3Gal and in its combination with other polyphenols. These unique properties of Cy3Gal are discussed and compared with other anthocyanins with related structure for an in-depth evaluation of its potential value as food additives or health supplement. Emphasis is laid on the description of its physiological functions confirmed via various approaches.

content extraction
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Incremental Acquisition Method for Web Forensics

A secure and efficient certificateless content extraction signature with privacy protection

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records

Feature Extraction Model with Group-Based Classifier for Content Extraction from Video Data

VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search

Page-Level Main Content Extraction From Heterogeneous Webpages

Survey Paper on Web Content Extraction & Classification

Lecture Information Service Based on Multiple Features Fusion

ChexMix: A Literature Content Extraction Tool for Bioentities

Cyanidin 3-O-galactoside: A Natural Compound with Multiple Health Benefits

Export Citation Format

content extractionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Incremental Acquisition Method for Web Forensics

A secure and efficient certificateless content extraction signature with privacy protection

Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records

Feature Extraction Model with Group-Based Classifier for Content Extraction from Video Data

VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search

Page-Level Main Content Extraction From Heterogeneous Webpages

Survey Paper on Web Content Extraction & Classification

Lecture Information Service Based on Multiple Features Fusion

ChexMix: A Literature Content Extraction Tool for Bioentities

Cyanidin 3-O-galactoside: A Natural Compound with Multiple Health Benefits

content extraction
Recently Published Documents