ANALYSIS OF AUTOMATED MODERN WEB CRAWLING AND TESTING TOOLS AND THEIR POSSIBLE EMPLOYMENT FOR INFORMATION EXTRACTION / ŠIUOLAIKINIŲ TINKLALAPIŲ AUTOMATIZUOTAM NARŠYMUI IR TESTAVIMUI SKIRTŲ PRIEMONIŲ ANALIZĖ IR PRITAIKOMUMAS INFORMACIJAI RINKTI

World Wide Web has become an enormously big repository of data. Extracting, integrating and reusing this kind of data has a wide range of applications, including meta-searching, comparison shopping, business intelligence tools and security analysis of information in websites. However, reaching information in modern WEB 2.0 web pages, where HTML tree is often dynamically modified by various JavaScript codes, new data are added by asynchronous requests to the web server and elements are positioned with the help of cascading style sheets, is a difficult task. The article reviews automated web testing tools for information extraction tasks. Santrauka Internetui tapus milžiniška informacijos duomenų baze, susiduriama su informacijos rinkimo problema – kaip iš itin gausaus kiekio informacijos šaltinių pasirinkti tokį, kuris gebėtų informacijos naudotojui pateikti tinkamą ir jį dominančią aktualią informaciją. Taip pat svarbu gebėti analizuoti šiuolaikinius tinklalapius saugumo prasme ir ieškoti juose, pavyzdžiui, įterpto slapto kenkėjiško kodo, o tai galima padaryti tik surinkus informaciją iš tinklalapio. Be to, nauja WEB 2.0 interneto karta priverčia keisti įprastinius informacijos rinkimo metodus, nes Flash, Javascript, Ajax ir kitos naujos technologijos trukdo surinkti informaciją vien tik analizuojant įprastą HTML kodą. Šiame straipsnyje analizuojamos sudėtingų šiuolaikinių tinklalapių naršymo automatizavimui ir testavimui skirtos priemonės, kurios gali būti panaudotos informacijai rinkti.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

Responsive and Personalized Web Layouts with Integer Programming

Proceedings of the ACM on Human-Computer Interaction ◽

10.1145/3461735 ◽

2021 ◽

Vol 5 (EICS) ◽

pp. 1-23

Author(s):

Markku Laine ◽

Yu Zhang ◽

Simo Santala ◽

Jussi P. P. Jokinen ◽

Antti Oulasvirta

Keyword(s):

Integer Programming ◽

Web Design ◽

Web Pages ◽

Web Page Design ◽

Automated Generation ◽

Novel Approach ◽

Wide Range ◽

Responsive Design ◽

Responsive Web Design ◽

Page Design

Over the past decade, responsive web design (RWD) has become the de facto standard for adapting web pages to a wide range of devices used for browsing. While RWD has improved the usability of web pages, it is not without drawbacks and limitations: designers and developers must manually design the web layouts for multiple screen sizes and implement associated adaptation rules, and its "one responsive design fits all" approach lacks support for personalization. This paper presents a novel approach for automated generation of responsive and personalized web layouts. Given an existing web page design and preferences related to design objectives, our integer programming -based optimizer generates a consistent set of web designs. Where relevant data is available, these can be further automatically personalized for the user and browsing device. The paper includes presentation of techniques for runtime adaptation of the designs generated into a fully responsive grid layout for web browsing. Results from our ratings-based online studies with end users (N = 86) and designers (N = 64) show that the proposed approach can automatically create high-quality responsive web layouts for a variety of real-world websites.

Download Full-text

Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

World Wide Web ◽

10.1007/s11280-007-0021-1 ◽

2007 ◽

Vol 10 (2) ◽

pp. 157-179 ◽

Cited By ~ 11

Author(s):

Srinivas Vadrevu ◽

Fatih Gelgi ◽

Hasan Davulcu

Keyword(s):

Information Extraction ◽

Domain Knowledge ◽

Web Pages

Download Full-text

Exploring the customer orientation of Spanish pharmacy websites

International Journal of Pharmaceutical and Healthcare Marketing ◽

10.1108/ijphm-04-2018-0025 ◽

2018 ◽

Vol 12 (4) ◽

pp. 447-462 ◽

Cited By ~ 2

Author(s):

Carmen Domínguez-Falcón ◽

Domingo Verano-Tacoronte ◽

Marta Suárez-Fuentes

Keyword(s):

Web 2.0 ◽

Customer Orientation ◽

Community Pharmacies ◽

Web Pages ◽

Web Page ◽

Pharmaceutical Sector ◽

Content Type ◽

Web Page Design ◽

Page Design ◽

The Web

Purpose The strong regulation of the Spanish pharmaceutical sector encourages pharmacies to modify their business model, giving the customer a more relevant role by integrating 2.0 tools. However, the study of the implementation of these tools is still quite limited, especially in terms of a customer-oriented web page design. This paper aims to analyze the online presence of Spanish community pharmacies by studying the profile of their web pages to classify them by their degree of customer orientation. Design/methodology/approach In total, 710 community pharmacies were analyzed, of which 160 had Web pages. Using items drawn from the literature, content analysis was performed to evaluate the presence of these items on the web pages. Then, after analyzing the scores on the items, a cluster analysis was conducted to classify the pharmacies according to the degree of development of their online customer orientation strategy. Findings The number of pharmacies with a web page is quite low. The development of these websites is limited, and they have a more informational than relational role. The statistical analysis allows to classify the pharmacies in four groups according to their level of development Practical implications Pharmacists should make incremental use of their websites to facilitate real two-way communication with customers and other stakeholders to maintain a relationship with them by having incorporated the Web 2.0 and social media (SM) platforms. Originality/value This study analyses, from a marketing perspective, the degree of Web 2.0 adoption and the characteristics of the websites, in terms of aiding communication and interaction with customers in the Spanish pharmaceutical sector.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text

Adapting SVM for data sparseness and imbalance: a case study in information extraction

Natural Language Engineering ◽

10.1017/s1351324908004968 ◽

2009 ◽

Vol 15 (2) ◽

pp. 241-271 ◽

Cited By ~ 31

Author(s):

YAOYONG LI ◽

KALINA BONTCHEVA ◽

HAMISH CUNNINGHAM

Keyword(s):

Active Learning ◽

Language Learning ◽

Information Extraction ◽

Language Processing ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Training Data ◽

Support Vector ◽

Passive Learning ◽

Wide Range

AbstractSupport Vector Machines (SVM) have been used successfully in many Natural Language Processing (NLP) tasks. The novel contribution of this paper is in investigating two techniques for making SVM more suitable for language learning tasks. Firstly, we propose an SVM with uneven margins (SVMUM) model to deal with the problem of imbalanced training data. Secondly, SVM active learning is employed in order to alleviate the difficulty in obtaining labelled training data. The algorithms are presented and evaluated on several Information Extraction (IE) tasks, where they achieved better performance than the standard SVM and the SVM with passive learning, respectively. Moreover, by combining SVMUM with the active learning algorithm, we achieve the best reported results on the seminars and jobs corpora, which are benchmark data sets used for evaluation and comparison of machine learning algorithms for IE. In addition, we also evaluate the token based classification framework for IE with three different entity tagging schemes. In comparison to previous methods dealing with the same problems, our methods are both effective and efficient, which are valuable features for real-world applications. Due to the similarity in the formulation of the learning problem for IE and for other NLP tasks, the two techniques are likely to be beneficial in a wide range of applications1.

Download Full-text

Information Extraction Based on Event Driven from Template Web Pages

Lecture Notes in Electrical Engineering - Proceedings of the 2012 International Conference on Information Technology and Software Engineering ◽

10.1007/978-3-642-34522-7_55 ◽

2012 ◽

pp. 515-523

Author(s):

Xiuhong Zhang ◽

Zhe Gong

Keyword(s):

Information Extraction ◽

Web Pages ◽

Event Driven

Download Full-text

New in understanding the burden of migraine: semantic analysis of the voice of Russian patients – users of Web 2.0

Neurology neuropsychiatry Psychosomatics ◽

10.14412/2074-2711-2021-6-73-84 ◽

2021 ◽

Vol 13 (6) ◽

pp. 73-84

Author(s):

G. R. Tabeeva ◽

Z. Katsarava ◽

A. V. Amelin ◽

A. V. Sergeev ◽

K. V. Skorobogatykh ◽

...

Keyword(s):

Quality Of Life ◽

Social Networks ◽

Web 2.0 ◽

Semantic Processing ◽

Semantic Analysis ◽

Text Messages ◽

The Internet ◽

Factors Affecting ◽

Wide Range

Migraine is the second leading cause of maladjustment, and the burden of migraine is determined by its impact on work ability, social activity and family relationships.Objective: to identify the patterns of behavior of Russian patients with migraine, factors affecting their quality of life, and the level of awareness of the disease based on a semantic analysis of messages in Web 2.0.Patients and methods. The study is based on the results of semantic processing (automated analysis of natural language texts, taking into account their meaning) of anonymized messages from 6566 unique authors (patients and their relatives) from social networks and forums (over 73 thousand messages over 10 years, 2010–2020). In addition, the study was carried out exclusively according to the data indicated in the messages. In this regard, complete data for several parameters was not available for analysis. No personal data about the authors of the messages was collected or used. The sex was determined based on the text of the analyzed message. For the study, only open data from the Internet from social networks and forums was used.Results and discussion. A landscape of problems of persons complaining of migraine issues was formed. Factors affecting the quality of life were grouped into four main groups (“Lifestyle restrictions by triggers of migraine attacks”, “Loss of opportunity to work”, “Serious psychological problems”, “Family planning issues”); additional, rarer, but acute problems were also identified. The analyzed messages show that the average number of days with migraines is 9.4 per month; 21.8% of patients report daily migraines. Moreover, most patients have been suffering from attacks for 10 years or more, and 9% of patients – for 30 years or more. The analysis of diagnostic patterns showed that in most cases, patients independently resorted to additional examination methods, while only 13.1% of patients had experience of adequate preventive therapy.Conclusion. The study demonstrated the presence of a wide range of unmet needs, quality of life problems both in patients themselves and their caregivers, as well as a significant social and economic burden of this disease (including a long-term burden on the economy, which can be used as arguments for reimbursing the cost of migraine therapy) based on the text messages on migraine in open sources on the Internet.

Download Full-text

Domain-Based Dynamic Ranking

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Advanced Research on Cloud Computing Design and Applications ◽

10.4018/978-1-4666-8676-2.ch017 ◽

2015 ◽

pp. 262-279

Author(s):

Sutirtha Kumar Guha ◽

Anirban Kundu ◽

Rana Dattagupta

Keyword(s):

Web Pages ◽

Cloud Environment ◽

Wide Range ◽

Turbulence Factor ◽

The Web

In this chapter a domain based ranking methodology is proposed in cloud environment. Web pages from the cloud are clustered as ‘Primary Domain' and ‘Secondary Domain'. ‘Primary' domain Web pages are fetched based on the direct matching with the keywords. ‘Primary Domain' Web pages are ranked based on Relevancy Factor (RF) and Turbulence Factor (TF). ‘Secondary Domain' is constructed by Nearest Keywords and Similar Web pages. Nearest Keywords are the keywords similar to the matched keywords. Similar Web pages are the Web pages having Nearest Keywords. Matched Web pages of ‘Primary' and ‘Secondary' domain are ranked separately. A wide range of Web pages from the cloud would be available and ranked more efficiently by this proposed approach.

Download Full-text

Managing E-Health in the Age of Web 2.0

E-Marketing ◽

10.4018/978-1-4666-1598-4.ch074 ◽

2012 ◽

pp. 1268-1288

Author(s):

Benjamin Hughes

Keyword(s):

Web 2.0 ◽

Policy Perspective ◽

Efficient Mechanism ◽

Wide Range ◽

Cost Efficient ◽

The Impact

The use of Web 2.0 internet tools for healthcare is noted for its great potential to address a wide range of healthcare issues or improve overall delivery. However, there have been various criticisms of Web 2.0, including in its application to healthcare where it has been described as more marketing and hype than a real departure from previous medical internet or eHealth trends. Authors have noted that there is scant evidence demonstrating it as a cost efficient mechanism to improve outcomes for patients. Moreover, the investments in Web 2.0 for health, or the wider concept of eHealth, are becoming increasingly significant. Hence given the uncertainty surrounding its value, this chapter aims to critically examine the issues associated with emerging use of Web 2.0 for health. The authors look at how it not only distinguishes itself from previous eHealth trends but also how it enhances them, examining the impact on eHealth investment and management from a policy perspective, and how research can aid this management.

Download Full-text