Data Extraction and Scratching Information Using R

Web scraping is the process of automatically extracting multiple WebPages from the World Wide Web. It is a field with active developments that shares a common goal with text processing, the semantic web vision, semantic understanding, machine learning, artificial intelligence and human- computer interactions. Current web scraping solutions range from requiring human effort, the ad-hoc, and to fully automated systems that are able to extract the required unstructured information, convert into structured information, with limitations. This paper describes a method for developing a web scraper using R programming that locates files on a website and then extracts the filtered data and stores it. The modules used and the algorithm of automating the navigation of a website via links are mentioned in this paper. Further it can be used for data analytics.

Download Full-text

Implementation of Web Application for Disease Prediction Using AI

10.54646/bijdmbd.002 ◽

2020 ◽

pp. 5-9

Author(s):

Manasvi Srivastava ◽

◽

Vikas Yadav ◽

Swati Singh ◽

◽

...

Keyword(s):

Web Application ◽

Ad Hoc ◽

Data Extraction ◽

Extraction Methods ◽

Web Page ◽

Web Based ◽

Web Extraction ◽

Web Scraping ◽

Audio Video ◽

Manual Extraction

The Internet is the largest source of information created by humanity. It contains a variety of materials available in various formats such as text, audio, video and much more. In all web scraping is one way. It is a set of strategies here in which we get information from the website instead of copying the data manually. Many Web-based data extraction methods are designed to solve specific problems and work on ad-hoc domains. Various tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the appropriateness and ethics of using these Web Scraping tools are often overlooked. There are hundreds of web scraping software available today, most of them designed for Java, Python and Ruby. There is also open source software and commercial software. Web-based software such as YahooPipes, Google Web Scrapers and Firefox extensions for Outwit are the best tools for beginners in web cutting. Web extraction is basically used to cut this manual extraction and editing process and provide an easy and better way to collect data from a web page and convert it into the desired format and save it to a local or archive directory. In this paper, among others the kind of scrub, we focus on those techniques that extract the content of a Web page. In particular, we use scrubbing techniques for a variety of diseases with their own symptoms and precautions.

Download Full-text

A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v7i2.3459 ◽

2013 ◽

Vol 7 (2) ◽

pp. 574-579 ◽

Cited By ~ 3

Author(s):

Dr Sunitha Abburu ◽

G. Suresh Babu

Keyword(s):

Information Extraction ◽

Data Extraction ◽

Research Work ◽

Web Pages ◽

Web Documents ◽

E Learning ◽

Structured Information ◽

Frame Work ◽

Effective Decision ◽

The Web

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

Download Full-text

A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis

International Journal of Web Information Systems ◽

10.1108/ijwis-03-2021-0037 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Irvin Dongo ◽

Yudith Cardinale ◽

Ana Aguilera ◽

Fabiola Martinez ◽

Yuni Quintero ◽

...

Keyword(s):

San Francisco ◽

Data Extraction ◽

Qualitative Evaluation ◽

Application Programming Interface ◽

Extraction Methods ◽

Content Type ◽

Qualitative And Quantitative ◽

Advantages And Disadvantages ◽

Web Scraping ◽

Shared Information

Purpose This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations. Design/methodology/approach As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods. Findings The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web. Originality/value Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Download Full-text

ANALYSIS OF PARSING AND WEBSCRAPING TOOLS WITHIN THE FRAMEWORK DEVELOPMENT OF ARBITRATION INVESTMENT STRATEGY ON THE MARKET SPORT BETS

SOFT MEASUREMENTS AND COMPUTING ◽

10.36871/2618-9976.2021.02.006 ◽

2021 ◽

Vol 1 (2) ◽

pp. 65-77

Author(s):

T. E. Vildanov ◽

◽

N. S. Ivanov ◽

Keyword(s):

Data Extraction ◽

Investment Strategy ◽

Semistructured Data ◽

Graphical Form ◽

Great Demand ◽

Rapid Extraction ◽

Web Scraping ◽

Quantitative Metrics ◽

Framework Development ◽

Number Of Iterations

This article explores both popular and newly invented tools for extracting data from sites and converting them into a form suitable for analysis. The paper compares the Python libraries, the key criterion of the compared tools is their performance. The results will be grouped by sites, tools used and number of iterations, and then presented in graphical form. The scientific novelty of the research lies in the field of application of data extraction tools: we will receive and transform semistructured data from the websites of bookmakers and betting exchanges. The article also describes new tools that are currently not in great demand in the field of parsing and web scraping. As a result of the study, quantitative metrics were obtained for all the tools used and the libraries that were most suitable for the rapid extraction and processing of information in large quantities were selected.

Download Full-text

Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.7540 ◽

2015 ◽

Vol 9 (2) ◽

pp. 111 ◽

Cited By ~ 1

Author(s):

Erma Susanti ◽

Khabib Mustofa

Keyword(s):

Information Extraction ◽

Language Processing ◽

Semantic Content ◽

Extraction Process ◽

Web Pages ◽

Structured Information ◽

Improved Performance ◽

Types Of Information ◽

Unstructured Information

AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performance

Download Full-text

Exploring the Adaptation of Recurrent Neural Network Approaches for Extracting Drug–Drug Interactions from Biomedical Text

International Journal of Machine Learning and Computing ◽

10.18178/ijmlc.2021.11.4.1046 ◽

2021 ◽

Vol 11 (4) ◽

pp. 267-273

Author(s):

Wen-Juan Hou ◽

◽

Bamfa Ceesay

Keyword(s):

Text Processing ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Biomedical Text ◽

Automatic Extraction ◽

Named Entity ◽

Structured Information ◽

Network Approaches ◽

Form Information

Information extraction (IE) is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several activities, such as named entity recognition, event extraction, relationship discovery, and document classification, with the overall goal of translating text into a more structured form. Information on the changes in the effect of a drug, when taken in combination with a second drug, is known as drug–drug interaction (DDI). DDIs can delay, decrease, or enhance absorption of drugs and thus decrease or increase their efficacy or cause adverse effects. Recent research trends have shown several adaptation of recurrent neural networks (RNNs) from text. In this study, we highlight significant challenges of using RNNs in biomedical text processing and propose automatic extraction of DDIs aiming at overcoming some challenges. Our results show that the system is competitive against other systems for the task of extracting DDIs.

Download Full-text

Equity and Sustainability in the Old Growth Forests of the Central Highlands of Victoria

Journal of Income Distribution ◽

10.25071/1874-6322.665 ◽

1997 ◽

Author(s):

A. Dragun

Keyword(s):

Forest Management ◽

Ad Hoc ◽

Well Being ◽

Economic Power ◽

General Issue ◽

The Face ◽

Human Effort ◽

Forest Use ◽

The Many ◽

Variable Efficiency

The general issue of forest use has been highly contentious in Victoria and considerable human effort has been exerted to establish the “best” use of forests. This economic, bureaucratic and political contemplation has yielded a multitude of different policy prescriptions with quite variable efficiency and equity outcomes. However, a feature of the analysis is that nowhere-on the grounds of efficiency or equity-is forestry logging the clearly desired outcome. Yet in the face of insurmountable evidence against logging, governments in Victoria prevaricate over making a formal decision not to log the forests-in fact the ad hoc approach to forest management favours the established forest interests. Clearly the narrow economic power and interests of a few logging companies are sufficient to counterbalance the much greater-but diffuse-well being of the many citizens in the state.

Download Full-text

Dynamic Behavior Analysis of Railway Passengers

Research Anthology on Strategies for Using Social Media as a Service and Tool in Business ◽

10.4018/978-1-7998-9020-1.ch039 ◽

2021 ◽

pp. 766-792

Author(s):

Myneni Madhu Bala ◽

Venkata Krishnaiah Ravilla ◽

Kamakshi Prasad V ◽

Akhil Dandamudi

Keyword(s):

Sentiment Analysis ◽

Dynamic Behavior ◽

Text Processing ◽

Data Extraction ◽

Emergency Situations ◽

Twitter Data ◽

Passenger Travel ◽

Text Content ◽

Comprehensive Framework

This chapter discusses mainly on dynamic behavior of railway passengers by using twitter data during regular and emergency situations. Social network data is providing dynamic and realistic data in various fields. As per the current chapter theme, if the twitter data of railway field is considered then it can be used for enhancement of railway services. Using this data, a comprehensive framework for modeling passenger tweets data which incorporates passenger opinions towards facilities provided by railways are discussed. The major issues elaborated regarding dynamic data extraction, preparation of twitter text content and text processing for finding sentiment levels is presented by two case studies; which are sentiment analysis on passenger's opinions about quality of railway services and identification of passenger travel demands using geotagged twitter data. The sentiment analysis ascertains passenger opinions towards facilities provided by railways either positive or negative based on their journey experiences.

Download Full-text

A survey on knowledge graph embeddings with literals: Which model links better literal-ly?

Semantic Web ◽

10.3233/sw-200404 ◽

2020 ◽

pp. 1-31

Author(s):

Genet Asefa Gesese ◽

Russa Biswas ◽

Mehwish Alam ◽

Harald Sack

Keyword(s):

Link Prediction ◽

Linked Data ◽

Question Answering ◽

Empirical Evaluation ◽

Entity Linking ◽

Relational Information ◽

Structured Information ◽

Low Dimensional ◽

And Storage ◽

Unstructured Information

Knowledge Graphs (KGs) are composed of structured information about a particular domain in the form of entities and relations. In addition to the structured information KGs help in facilitating interconnectivity and interoperability between different resources represented in the Linked Data Cloud. KGs have been used in a variety of applications such as entity linking, question answering, recommender systems, etc. However, KG applications suffer from high computational and storage costs. Hence, there arises the necessity for a representation able to map the high dimensional KGs into low dimensional spaces, i.e., embedding space, preserving structural as well as relational information. This paper conducts a survey of KG embedding models which not only consider the structured information contained in the form of entities and relations in a KG but also its unstructured information represented as literals such as text, numerical values, images, etc. Along with a theoretical analysis and comparison of the methods proposed so far for generating KG embeddings with literals, an empirical evaluation of the different methods under identical settings has been performed for the general task of link prediction.

Download Full-text

Expediting the ICC Criminal Process: Striking the Right Balance between the ICC and States Parties

International Criminal Law Review ◽

10.1163/15718123-01803003 ◽

2018 ◽

Vol 18 (3) ◽

pp. 383-425

Author(s):

Hirad Abtahi ◽

Shehzad Charania

Keyword(s):

Ad Hoc ◽

International Criminal ◽

Legal Framework ◽

Criminal Court ◽

Common Goal ◽

Criminal Proceedings ◽

Criminal Process ◽

State Involvement ◽

Rules Of Procedure ◽

The Right

When establishing the ICC, the sole permanent international criminal court, States ensured that they would play a legislative role larger and more direct than the ad hoc and hybrid courts and tribunals. States Parties have, however, acknowledged that, given the time they spend interpreting and applying the ICC legal framework, the judges are uniquely placed to identify and propose measures designed to expedite the criminal process. Accordingly, the ICC has followed a dual track. First, it has pursued an amendment track, which requires States Parties’ direct approval of ICC proposed amendments to the Rules of Procedure and Evidence. Second, it has implemented practices changes that do not require State involvement. This interactive process between the Court and States Parties reflects their common goal to expedite the criminal proceedings. The future of this process will rely on striking the right equilibrium between the respective roles of States Parties and the Court.

Download Full-text