Automatically evaluating the quality of textual descriptions in cultural heritage records

International Journal on Digital Libraries ◽

10.1007/s00799-021-00302-1 ◽

2021 ◽

Author(s):

Matteo Lorenzini ◽

Marco Rospocher ◽

Sara Tonelli

Keyword(s):

Machine Learning ◽

Cultural Heritage ◽

Digital Library ◽

Text Classification ◽

Digital Libraries ◽

Quality Standards ◽

Classification Performance ◽

Training Data ◽

Novel Approach

AbstractMetadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library “Cultura Italia” and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 $$\sim $$ ∼ 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance.

Download Full-text

Accessibility evaluation of topographic maps in the National Library of Poland

Abstracts of the ICA ◽

10.5194/ica-abs-1-201-2019 ◽

2019 ◽

Vol 1 ◽

pp. 1-1

Author(s):

Marta Kuźma ◽

Albina Mościcka

Keyword(s):

Cultural Heritage ◽

Digital Library ◽

Digital Libraries ◽

Research Question ◽

Interdisciplinary Studies ◽

Topographic Maps ◽

Specific Information ◽

National Library ◽

Metadata Quality

Abstract. Digital libraries are created and managed mainly by traditional libraries, archives and museums. They collect, process, and make available digitized collections and data about them. These collections often constitute cultural heritage and they include, among others: books (including old prints), magazines, manuscripts, photographs, maps, atlases, postcards and graphics. An example of such a library is the National Library of Poland. It collects and provides digitally available data of about 55,000 maps.The effective use of cultural heritage resources and information from National Library of Poland gives the prerequisites and challenges for multidisciplinary research and cross-sectoral cooperation. These resources are an unlimited source of knowledge, constituting value in themselves but also providing data for many new studies, including interdisciplinary studies of the past. Information necessary for such research is usually distributed across a wide spectrum of fields, formats and languages, reflecting different points of view, and the key task is to find them in digital libraries.The growth of digital library collections requires high-quality metadata to make the materials collected by libraries fully accessible and to enable their integration and sharing between institutions. Consequently, three main metadata quality criteria have been defined to enable metadata management and evaluation. They are: accuracy, consistency, and completeness (Park, 2009, Park and Tosaka, 2010). Different aspects of metadata quality can also be defined as: accessibility, accuracy, availability, compactness, comprehensiveness, content, consistency, cost, data structure, ease of creation, ease of use, cost efficiency, flexibility, fitness for use, informativeness, quantity, reliability, standard, timeliness, transfer, usability (Moen et al., 1998). This list tells us where errors in metadata occur, which can result in hindering or completely disabling access to materials available through a digital library.Archival maps have always been present in the libraries. In the digital age, geographical space has begun to exist in libraries in two aspects: as old maps’ collections, as well as a geographic reference of sources other than cartographic materials. Despite many experiences in this field, the authors emphasize that the main problem is related to the fact that most libraries are not populating the coordinates to the metadata, which is required to enable and support geographical search (Southall and Pridal, 2012).During this stage the concept of research is born and the source materials necessary for the realization of this concept are collected. When using archival maps for such studies, it is important to be aware of detailed literature studies, including cartographic assumptions, the course and accuracy of cartographic works, the way of printing, the scope of updates of subsequent editions, and the period in which the given map was created. The ability to use cartographic materials also depends on the destination map. The awareness of the above issues allows researchers to avoid errors frequently made by non-cartographers, i.e. to prevent comparing maps on different scales and treating them as a basis for formulating very detailed yet unfortunately erroneous conclusions. Thus, one of the key tasks is to find materials that are comparable in terms of scale and that cover the same area and space in the historical period of interest.The research aim is to evaluate the quality of topographic maps metadata provided by the National Library of Poland, which are the basis for effective access to cartographic resources.The first research question is: how should topographic maps be described in metadata to enable finding them in the National Library of Poland? In other words, what kind of map-specific information should be saved in metadata (and in what way) to provide the proper characteristic of the spatially-related object?The second research question is: which topographic maps have the best metadata in such a way as to give the users the best chance of finding the cartographic materials necessary for their research?The paper will present the results of research connected with finding criteria and features to metadata evaluation, it means how archival maps are described. For the maps, it is a set of map features, which are collected in the metadata. This set includes the geographic location, map scale, map orientation, and cartographic presentation methods. The conducted evaluation refers to the quality of metadata, or, in other words, the accessibility of archival cartographic resources.

Download Full-text

Headnote Prediction Using Machine Learning

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/5/7 ◽

2021 ◽

Vol 18 (5) ◽

Author(s):

Sarmad Mahar ◽

Sahar Zafar ◽

Kamran Nishat

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Active Learning ◽

Text Classification ◽

Extraction Methods ◽

Text Summarization ◽

Training Data ◽

Second Step ◽

Support Vector ◽

Classification Algorithms

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.

Download Full-text

Prediction of Gas Turbine Trip: a Novel Methodology Based on Random Forest Models

10.1115/gt2021-58916 ◽

2021 ◽

Author(s):

Enzo Losi ◽

Mauro Venturini ◽

Lucrezia Manservigi ◽

Giuseppe Fabio Ceschini ◽

Giovanni Bechini ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Gas Turbine ◽

Gas Turbines ◽

Remaining Useful Life ◽

Training Data ◽

The Novel ◽

Novel Approach ◽

Forest Models ◽

Random Forest Models

Abstract A gas turbine trip is an unplanned shutdown, of which the most relevant consequences are business interruption and a reduction of equipment remaining useful life. Thus, understanding the underlying causes of gas turbine trip would allow predicting its occurrence in order to maximize gas turbine profitability and improve its availability. In the ever competitive Oil & Gas sector, data mining and machine learning are increasingly being employed to support a deeper insight and improved operation of gas turbines. Among the various machine learning tools, Random Forests are an ensemble learning method consisting of an aggregation of decision tree classifiers. This paper presents a novel methodology aimed at exploiting information embedded in the data and develops Random Forest models, aimed at predicting gas turbine trip based on information gathered during a timeframe of historical data acquired from multiple sensors. The novel approach exploits time series segmentation to increase the amount of training data, thus reducing overfitting. First, data are transformed according to a feature engineering methodology developed in a separate work by the same authors. Then, Random Forest models are trained and tested on unseen observations to demonstrate the benefits of the novel approach. The superiority of the novel approach is proved by considering two real-word case-studies, involving filed data taken during three years of operation of two fleets of Siemens gas turbines located in different regions. The novel methodology allows values of Precision, Recall and Accuracy in the range 75–85 %, thus demonstrating the industrial feasibility of the predictive methodology.

Download Full-text

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection

Journal of Integrative Bioinformatics ◽

10.1515/jib-2017-0032 ◽

2017 ◽

Vol 14 (2) ◽

Author(s):

Müşerref Duygu Saçar Demirci ◽

Jens Allmer

Keyword(s):

Machine Learning ◽

Training Data ◽

Quality Data ◽

Virus Infections ◽

High Quality ◽

Disease States ◽

Positive Data ◽

Mirna Detection ◽

Post Transcriptional Regulation

AbstractMicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods generally employ machine learning to establish models for the discrimination between miRNAs and other sequences. Positive training data for model establishment, for the most part, stems from miRBase, the miRNA registry. The quality of the entries in miRBase has been questioned, though. This unknown quality led to the development of filtering strategies in attempts to produce high quality positive datasets which can lead to a scarcity of positive data. To analyze the quality of filtered data we developed a machine learning model and found it is well able to establish data quality based on intrinsic measures. Additionally, we analyzed which features describing pre-miRNAs could discriminate between low and high quality data. Both models are applicable to data from miRBase and can be used for establishing high quality positive data. This will facilitate the development of better miRNA detection tools which will make the prediction of miRNAs in disease states more accurate. Finally, we applied both models to all miRBase data and provide the list of high quality hairpins.

Download Full-text

Machine Learning Approach to Classify Rain Type Based on Thies Disdrometers and Cloud Observations

Atmosphere ◽

10.3390/atmos10050251 ◽

2019 ◽

Vol 10 (5) ◽

pp. 251 ◽

Cited By ~ 4

Author(s):

Wael Ghada ◽

Nicole Estrella ◽

Annette Menzel

Keyword(s):

Machine Learning ◽

Geographical Location ◽

Classification Performance ◽

Classification Methods ◽

Reference Dataset ◽

Machine Learning Classification ◽

Machine Learning Approach ◽

Different Types ◽

Microstructure Parameters

Rain microstructure parameters assessed by disdrometers are commonly used to classify rain into convective and stratiform. However, different types of disdrometer result in different values for these parameters. This in turn potentially deteriorates the quality of rain type classifications. Thies disdrometer measurements at two sites in Bavaria in southern Germany were combined with cloud observations to construct a set of clear convective and stratiform intervals. This reference dataset was used to study the performance of classification methods from the literature based on the rain microstructure. We also explored the possibility of improving the performance of these methods by tuning the decision boundary. We further identified highly discriminant rain microstructure parameters and used these parameters in five machine-learning classification models. Our results confirm the potential of achieving high classification performance by applying the concepts of machine learning compared to already available methods. Machine-learning classification methods provide a concrete and flexible procedure that is applicable regardless of the geographical location or the device. The suggested procedure for classifying rain types is recommended prior to studying rain microstructure variability or any attempts at improving radar estimations of rain intensity.

Download Full-text

Enzymatic Weight Update Algorithm for DNA-Based Molecular Learning

Molecules ◽

10.3390/molecules24071409 ◽

2019 ◽

Vol 24 (7) ◽

pp. 1409 ◽

Cited By ~ 1

Author(s):

Christina Baek ◽

Sang-Woo Lee ◽

Beom-Jin Lee ◽

Dong-Hyun Kwak ◽

Byoung-Tak Zhang

Keyword(s):

Machine Learning ◽

Dna Sequences ◽

Dna Nanotechnology ◽

Search Space ◽

Molecular Computing ◽

Training Data ◽

Molecular Systems ◽

Novel Approach ◽

Biological Substrates

Recent research in DNA nanotechnology has demonstrated that biological substrates can be used for computing at a molecular level. However, in vitro demonstrations of DNA computations use preprogrammed, rule-based methods which lack the adaptability that may be essential in developing molecular systems that function in dynamic environments. Here, we introduce an in vitro molecular algorithm that ‘learns’ molecular models from training data, opening the possibility of ‘machine learning’ in wet molecular systems. Our algorithm enables enzymatic weight update by targeting internal loop structures in DNA and ensemble learning, based on the hypernetwork model. This novel approach allows massively parallel processing of DNA with enzymes for specific structural selection for learning in an iterative manner. We also introduce an intuitive method of DNA data construction to dramatically reduce the number of unique DNA sequences needed to cover the large search space of feature sets. By combining molecular computing and machine learning the proposed algorithm makes a step closer to developing molecular computing technologies for future access to more intelligent molecular systems.

Download Full-text

Cognitive styles within an exploratory search system for digital libraries

Journal of Documentation ◽

10.1108/jd-03-2014-0045 ◽

2014 ◽

Vol 70 (6) ◽

pp. 970-996 ◽

Cited By ~ 11

Author(s):

Paula Goodale ◽

Paul David Clough ◽

Samuel Fernando ◽

Nigel Ford ◽

Mark Stevenson

Keyword(s):

Cultural Heritage ◽

Cognitive Style ◽

Digital Library ◽

Digital Libraries ◽

Cognitive Abilities ◽

Design Methodology ◽

Search System ◽

Content Type ◽

User Attitudes ◽

Practical Implications

Purpose – The purpose of this paper is to investigate the effects of cognitive style on navigating a large digital library of cultural heritage information; specifically, the paper focus on the wholist/analytic dimension as experienced in the field of educational informatics. The hypothesis is that wholist and analytic users have characteristically different approaches when they explore, search and interact with digital libraries, which may have implications for system design. Design/methodology/approach – A detailed interactive IR evaluation of a large cultural heritage digital library was undertaken, along with the Riding CSA test. Participants carried out a range of information tasks, and the authors analysed their task performance, interactions and attitudes. Findings – The hypothesis on the differences in performance and behaviour between wholist and analytic users is supported. However, the authors also find that user attitudes towards the system are opposite to expectations and that users give positive feedback for functionality that supports activities in which they are cognitively weaker. Research limitations/implications – There is scope for testing results in a larger scale study, and/or with different systems. In particular, the findings on user attitudes warrant further investigation. Practical implications – Findings on user attitudes suggest that systems which support areas of weakness in users’ cognitive abilities are valued, indicating an opportunity to offer diverse functionality to support different cognitive weaknesses. Originality/value – A model is proposed suggesting a converse relationship between behaviour and attitudes; to support individual users displaying search/navigation behaviour mapped onto the strengths of their cognitive style, but placing greater value on interface features that support aspects in which they are weaker.

Download Full-text

Efficient processing of GRU based on word embedding for text classification

JOIV International Journal on Informatics Visualization ◽

10.30630/joiv.3.4.289 ◽

2019 ◽

Vol 3 (4) ◽

Cited By ~ 2

Author(s):

Muhammad Zulqarnain ◽

Rozaida Ghazali ◽

Muhammad Ghulam Ghouse ◽

Muhammad Faheem Mushtaq

Keyword(s):

Language Processing ◽

Text Classification ◽

Classification Performance ◽

Word Embedding ◽

Training Data ◽

Superior Performance ◽

Sequential Data ◽

Online Data ◽

Benchmark Datasets ◽

Recurrent Architecture

Text classification has become very serious problem for big organization to manage the large amount of online data and has been extensively applied in the tasks of Natural Language Processing (NLP). Text classification can support users to excellently manage and exploit meaningful information require to be classified into various categories for further use. In order to best classify texts, our research efforts to develop a deep learning approach which obtains superior performance in text classification than other RNNs approaches. However, the main problem in text classification is how to enhance the classification accuracy and the sparsity of the data semantics sensitivity to context often hinders the classification performance of texts. In order to overcome the weakness, in this paper we proposed unified structure to investigate the effects of word embedding and Gated Recurrent Unit (GRU) for text classification on two benchmark datasets included (Google snippets and TREC). GRU is a well-known type of recurrent neural network (RNN), which is ability of computing sequential data over its recurrent architecture. Experimentally, the semantically connected words are commonly near to each other in embedding spaces. First, words in posts are changed into vectors via word embedding technique. Then, the words sequential in sentences are fed to GRU to extract the contextual semantics between words. The experimental results showed that proposed GRU model can effectively learn the word usage in context of texts provided training data. The quantity and quality of training data significantly affected the performance. We evaluated the performance of proposed approach with traditional recurrent approaches, RNN, MV-RNN and LSTM, the proposed approach is obtained better results on two benchmark datasets in the term of accuracy and error rate.

Download Full-text

Cell morphology-based machine learning models for human cell state classification

npj Systems Biology and Applications ◽

10.1038/s41540-021-00180-y ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Yi Li ◽

Chance M. Nowak ◽

Uyen Pham ◽

Khai Nguyen ◽

Leonidas Bleris

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Multilayer Perceptron ◽

Nearest Neighbor ◽

Annexin V ◽

Classification Performance ◽

Training Data ◽

Support Vector ◽

Apoptotic Cells ◽

K Nearest Neighbor

AbstractHerein, we implement and access machine learning architectures to ascertain models that differentiate healthy from apoptotic cells using exclusively forward (FSC) and side (SSC) scatter flow cytometry information. To generate training data, colorectal cancer HCT116 cells were subjected to miR-34a treatment and then classified using a conventional Annexin V/propidium iodide (PI)-staining assay. The apoptotic cells were defined as Annexin V-positive cells, which include early and late apoptotic cells, necrotic cells, as well as other dying or dead cells. In addition to fluorescent signal, we collected cell size and granularity information from the FSC and SSC parameters. Both parameters are subdivided into area, height, and width, thus providing a total of six numerical features that informed and trained our models. A collection of logistical regression, random forest, k-nearest neighbor, multilayer perceptron, and support vector machine was trained and tested for classification performance in predicting cell states using only the six aforementioned numerical features. Out of 1046 candidate models, a multilayer perceptron was chosen with 0.91 live precision, 0.93 live recall, 0.92 live f value and 0.97 live area under the ROC curve when applied on standardized data. We discuss and highlight differences in classifier performance and compare the results to the standard practice of forward and side scatter gating, typically performed to select cells based on size and/or complexity. We demonstrate that our model, a ready-to-use module for any flow cytometry-based analysis, can provide automated, reliable, and stain-free classification of healthy and apoptotic cells using exclusively size and granularity information.

Download Full-text

Cloud Resource Demand Prediction using Machine Learning in the Context of QoS Parameters

Journal of Grid Computing ◽

10.1007/s10723-021-09561-3 ◽

2021 ◽

Vol 19 (2) ◽

Author(s):

Piotr Nawrocki ◽

Patryk Osypanka

Keyword(s):

Machine Learning ◽

Cloud Computing ◽

Resource Configuration ◽

Demand Prediction ◽

Novel Approach ◽

Cloud Resource ◽

Management Of Resources ◽

System Responsiveness ◽

System Characteristics

AbstractPredicting demand for computing resources in any system is a vital task since it allows the optimized management of resources. To some degree, cloud computing reduces the urgency of accurate prediction as resources can be scaled on demand, which may, however, result in excessive costs. Numerous methods of optimizing cloud computing resources have been proposed, but such optimization commonly degrades system responsiveness which results in quality of service deterioration. This paper presents a novel approach, using anomaly detection and machine learning to achieve cost-optimized and QoS-constrained cloud resource configuration. The utilization of these techniques enables our solution to adapt to different system characteristics and different QoS constraints. Our solution was evaluated using a system located in Microsoft’s Azure cloud environment, and its efficiency in other providers’ computing clouds was estimated as well. Experiment results demonstrate a cost reduction ranging from 51% to 85% (for PaaS/IaaS) over the tested period.

Download Full-text