Exploring the Potentialities of Automatic Extraction of University Webometric Information

AbstractPurposeThe main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.Design/methodology/approachWebometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, www.bing.com) or from a leading provider of Web analytics (SimilarWeb, http://www.similarweb.com). The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (https://eter.joanneum.at/#/home), a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.FindingsThe main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.Research limitationsThe results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.Practical implicationsThe approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.Originality/valueThis work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Download Full-text

WEB SCRAPING AND DATA SCIENCE IN APPLIED RESEARCH IN COMMUNICATION: a study on online reviews

Revista Observatório ◽

10.20873/uft.2447-4266.2021v7n3a1en ◽

2021 ◽

Vol 7 (3) ◽

pp. a1en

Author(s):

Marcello Tenorio de Farias ◽

Alan César Belo Angeluci ◽

Brasilina Passarelli

Keyword(s):

Data Science ◽

Applied Research ◽

Online Reviews ◽

Google Maps ◽

Automatic Data ◽

Applied Study ◽

Prototype Tool ◽

Web Scraping ◽

Manual Methods ◽

The Web

With the spread of access and use of information through the web and social networks, information retrieval in large volumes of data has become unfeasible by manual methods. In this applied study, the contribution of the development and use of a prototype tool for automatic data scraping from online evaluations made on Google Maps – Discovery Stars – was reported. The retrieved data allowed us to investigate how these assessments can have the potential to influence the behavior of the platform's users. Among the results, it was observed that the reading and posting of reviews impact the formation of opinion and motivations of Google Maps users.

Download Full-text

An Analysis of Naïve Bayes Hypothesis for Web Scraping and Data Mining

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i7.217 ◽

2018 ◽

Vol 6 (7) ◽

pp. 234 ◽

Cited By ~ 1

Author(s):

Harmandeep Kaur ◽

Kamaljit Kaur Dhillon

Keyword(s):

Data Mining ◽

Web Mining ◽

Naive Bayes ◽

Naïve Bayes ◽

Web Scraping ◽

The Web

This article approaches the utilization of the Naive Bayes (in a matter of moments NB) classifier. It exhibits that the count NB improves the assignments of the Web mining by the precision reports arrange. This recommendation separated the execution of Naïve Bayes count with other gathering frameworks. The probability of making a gathering model for doling out the Scholarships to understudies by focusing on precision of the system, numerous components have been bankrupt down, notwithstanding several them are discovered capable when accuracy was considered

Download Full-text

A Text Mining using Web Scraping for Meaningful Insights

Journal of Physics Conference Series ◽

10.1088/1742-6596/2089/1/012048 ◽

2021 ◽

Vol 2089 (1) ◽

pp. 012048

Author(s):

Kishor Kumar Reddy C ◽

P R Anisha ◽

Nhu Gia Nguyen ◽

G Sreelatha

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Text Mining ◽

Natural Language ◽

Language Processing ◽

Text Summarization ◽

Learning Technology ◽

Web Scraping ◽

Textual Data ◽

The Web

Abstract This research involves the usage of Machine Learning technology and Natural Language Processing (NLP) along with the Natural Language Tool-Kit (NLTK). This helps develop a logical Text Summarization tool, which uses the Extractive approach to generate an accurate and a fluent summary. The aim of this tool is to efficiently extract a concise and a coherent version, having only the main needed outline points from the long text or the input document avoiding any type of repetitions of the same text or information that has already been mentioned earlier in the text. The text to be summarized can be inherited from the web using the process of web scraping or entering the textual data manually on the platform i.e., the tool. The summarization process can be quite beneficial for the users as these long texts, needs to be shortened to help them to refer to the input quickly and understand points that might be out of their scope to understand.

Download Full-text

Data Text Mining Based on Swarm Intelligence Techniques

Trends and Applications of Text Summarization Techniques - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9373-7.ch004 ◽

2020 ◽

pp. 88-124

Author(s):

Mohamed Atef Mosa

Keyword(s):

Text Mining ◽

Swarm Intelligence ◽

Web Mining ◽

Heuristic Algorithms ◽

Optimization Techniques ◽

Short Text ◽

First Time ◽

Automatic Text ◽

Great Growth ◽

The Web

Due to the great growth of data on the web, mining to extract the most informative data as a conceptual brief would be beneficial for certain users. Therefore, there is great enthusiasm concerning the developing automatic text summary approaches. In this chapter, the authors highlight using the swarm intelligence (SI) optimization techniques for the first time in solving the problem of text summary. In addition, a convincing justification of why nature-heuristic algorithms, especially ant colony optimization (ACO), are the best algorithms for solving complicated optimization tasks is introduced. Moreover, it has been perceived that the problem of text summary had not been formalized as a multi-objective optimization (MOO) task before, despite there are many contradictory objectives in needing to be achieved. The SI has not been employed before to support the real-time tasks. Therefore, a novel framework of short text summary has been proposed to fulfill this issue. Ultimately, this chapter will enthuse researchers for further consideration for SI algorithms in solving summary tasks.

Download Full-text

Text Mining for Personalized Knowledge Extraction From Online Support Groups

Journal of the Association for Information Science and Technology ◽

10.1002/asi.24063 ◽

2018 ◽

Vol 69 (12) ◽

pp. 1446-1459 ◽

Cited By ~ 4

Author(s):

Tharindu Rukshan Bandaragoda ◽

Daswin De Silva ◽

Damminda Alahakoon ◽

Weranja Ranasinghe ◽

Damien Bolton

Keyword(s):

Text Mining ◽

Support Groups ◽

Knowledge Extraction ◽

Online Support ◽

Online Support Groups

Download Full-text

An Automatic Extraction of Academia-Industry Collaborative Research and Development Documents on the Web

Studies in Classification, Data Analysis, and Knowledge Organization - German-Japanese Interchange of Data Analysis Results ◽

10.1007/978-3-319-01264-3_18 ◽

2013 ◽

pp. 203-211

Author(s):

Kei Kurakawa ◽

Yuan Sun ◽

Nagayoshi Yamashita ◽

Yasumasa Baba

Keyword(s):

Research And Development ◽

Collaborative Research ◽

Automatic Extraction ◽

The Web

Download Full-text

PaperBLAST: Text-mining papers for information about homologs

10.1101/133041 ◽

2017 ◽

Author(s):

Morgan N. Price ◽

Adam P. Arkin

Keyword(s):

Text Mining ◽

Genome Sequencing ◽

Full Text ◽

Large Scale ◽

Scientific Literature ◽

Protein Sequences ◽

Protein Coding ◽

Link Protein ◽

Protein Coding Genes ◽

Link Type

AbstractLarge-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources that link protein sequences to scientific articles (Swiss-Prot, GeneRIF, and EcoCyc). PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/.

Download Full-text

Automated Data Collection withR- A Practical Guide to Web Scraping and Text Mining

Journal of Statistical Software ◽

10.18637/jss.v068.b03 ◽

2015 ◽

Vol 68 (Book Review 3) ◽

Cited By ~ 3

Author(s):

Stefano M. Iacus

Keyword(s):

Text Mining ◽

Data Collection ◽

Automated Data Collection ◽

Practical Guide ◽

Web Scraping

Download Full-text

User’s Behaviour inside a Digital Library

Data Mining ◽

10.4018/978-1-4666-2455-9.ch067 ◽

2013 ◽

pp. 1312-1319

Author(s):

Marco Scarnò

Keyword(s):

Data Mining ◽

Digital Library ◽

Web Mining ◽

Dynamic Change ◽

Web Server ◽

Scientific Field ◽

Data Mining Techniques ◽

Log File ◽

The Web ◽

The Way

CASPUR allows many academic Italian institutions located in the Centre-South of Italy to access more than 7 million articles through a digital library platform. The behaviour of its users were analyzed by considering their “traces”, which are stored in the web server log file. Using several web mining and data mining techniques the author discovered a gradual and dynamic change in the way articles are accessed. In particular there is evidence of a journal browsing increase in comparison to the searching mode. Such phenomenon were interpreted using the idea that browsing better meets the needs of users when they want to keep abreast about the latest advances in their scientific field, in comparison to a more generic searching inside the digital library.

Download Full-text

treeWidget: a BioJS component to visualise phylogenetic trees

F1000Research ◽

10.12688/f1000research.3-49.v1 ◽

2014 ◽

Vol 3 ◽

pp. 49 ◽

Cited By ~ 1

Author(s):

Fabian Schreiber

Keyword(s):

Phylogenetic Trees ◽

Protein Domains ◽

Gene Families ◽

Sequence Information ◽

Gene Duplications ◽

Link Type ◽

History Of ◽

Conservation Patterns ◽

Evolution Of Gene Families ◽

The Web

Summary: Phylogenetic trees are widely used to represent the evolution of gene families. As the history of gene families can be complex (including lots of gene duplications), its visualisation can become a difficult task. A good/accurate visualisation of phylogenetic trees - especially on the web - allows easier understanding and interpretation of trees to help to reveal the mechanisms that shape the evolution of a specific set of gene/species. Here, I present treeWidget, a modular BioJS component to visualise phylogenetic trees on the web. Through its modularity, treeWidget can be easily customized to allow the display of sequence information, e.g. protein domains and alignment conservation patterns.Availability: http://github.com/biojs/biojs; http://dx.doi.org/10.5281/zenodo.7707

Download Full-text