Machine Learning on Graph-Structured Data

Mapping Intimacies ◽

10.5753/sbbd_estendido.2021.18179 ◽

2021 ◽

Author(s):

Claudio D. T. Barros ◽

Daniel N. R. da Silva ◽

Fabio A. M. Porto

Keyword(s):

Machine Learning ◽

Biological Networks ◽

Representation Learning ◽

Structured Data ◽

Graph Classification ◽

Open Problems ◽

Continuous Increase ◽

Knowledge Graphs ◽

Traditional Approaches

Several real-world complex systems have graph-structured data, including social networks, biological networks, and knowledge graphs. A continuous increase in the quantity and quality of these graphs demands learning models to unlock the potential of this data and execute tasks, including node classification, graph classification, and link prediction. This tutorial presents machine learning on graphs, focusing on how representation learning - from traditional approaches (e.g., matrix factorization and random walks) to deep neural architectures - fosters carrying out those tasks. We also introduce representation learning over dynamic and knowledge graphs. Lastly, we discuss open problems, such as scalability and distributed network embedding systems.

Download Full-text

Technologies for Complex Intelligent Clinical Data Analysis

Annals of the Russian academy of medical sciences ◽

10.15690/vramn663 ◽

2016 ◽

Vol 71 (2) ◽

pp. 160-171 ◽

Cited By ~ 4

Author(s):

A. A. Baranov ◽

L. S. Namazova-Baranova ◽

I. V. Smirnov ◽

D. A. Devyatkin ◽

A. O. Shelmanov ◽

...

Keyword(s):

Machine Learning ◽

Data Analysis ◽

Chronic Diseases ◽

Clinical Information ◽

Structured Data ◽

Unstructured Data ◽

Intelligent Analysis ◽

Pediatric Center

The paper presents the system for intelligent analysis of clinical information. Authors describe methods implemented in the system for clinical information retrieval, intelligent diagnostics of chronic diseases, patient’s features importance and for detection of hidden dependencies between features. Results of the experimental evaluation of these methods are also presented.Background: Healthcare facilities generate a large flow of both structured and unstructured data which contain important information about patients. Test results are usually retained as structured data but some data is retained in the form of natural language texts (medical history, the results of physical examination, and the results of other examinations, such as ultrasound, ECG or X-ray studies). Many tasks arising in clinical practice can be automated applying methods for intelligent analysis of accumulated structured array and unstructured data that leads to improvement of the healthcare quality.Aims: the creation of the complex system for intelligent data analysis in the multi-disciplinary pediatric center.Materials and methods: Authors propose methods for information extraction from clinical texts in Russian. The methods are carried out on the basis of deep linguistic analysis. They retrieve terms of diseases, symptoms, areas of the body and drugs. The methods can recognize additional attributes such as «negation» (indicates that the disease is absent), «no patient» (indicates that the disease refers to the patient’s family member, but not to the patient), «severity of illness», «disease course», «body region to which the disease refers». Authors use a set of hand-drawn templates and various techniques based on machine learning to retrieve information using a medical thesaurus. The extracted information is used to solve the problem of automatic diagnosis of chronic diseases. A machine learning method for classification of patients with similar nosology and the method for determining the most informative patients’ features are also proposed.Results: Authors have processed anonymized health records from the pediatric center to estimate the proposed methods. The results show the applicability of the information extracted from the texts for solving practical problems. The records of patients with allergic, glomerular and rheumatic diseases were used for experimental assessment of the method of automatic diagnostic. Authors have also determined the most appropriate machine learning methods for classification of patients for each group of diseases, as well as the most informative disease signs. It has been found that using additional information extracted from clinical texts, together with structured data helps to improve the quality of diagnosis of chronic diseases. Authors have also obtained pattern combinations of signs of diseases.Conclusions: The proposed methods have been implemented in the intelligent data processing system for a multidisciplinary pediatric center. The experimental results show the availability of the system to improve the quality of pediatric healthcare.

Download Full-text

Assessing the Quality of Sources in Wikidata Across Languages: A Hybrid Approach

Journal of Data and Information Quality ◽

10.1145/3484828 ◽

2021 ◽

Vol 13 (4) ◽

pp. 1-35

Author(s):

Gabriel Amaral ◽

Alessandro Piscopo ◽

Lucie-aimée Kaffee ◽

Odinaldo Rodrigues ◽

Elena Simperl

Keyword(s):

Machine Learning ◽

Scale Up ◽

Hybrid Approach ◽

Structured Data ◽

Mixed Methods Study ◽

Secondary Source ◽

Ease Of Access ◽

Large Corpus ◽

The Web

Wikidata is one of the most important sources of structured data on the web, built by a worldwide community of volunteers. As a secondary source, its contents must be backed by credible references; this is particularly important, as Wikidata explicitly encourages editors to add claims for which there is no broad consensus, as long as they are corroborated by references. Nevertheless, despite this essential link between content and references, Wikidata's ability to systematically assess and assure the quality of its references remains limited. To this end, we carry out a mixed-methods study to determine the relevance, ease of access, and authoritativeness of Wikidata references, at scale and in different languages, using online crowdsourcing, descriptive statistics, and machine learning. Building on previous work of ours, we run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web. We also discuss ongoing editorial practices, which could encourage the use of higher-quality references in a more immediate way. All data and code used in the study are available on GitHub for feedback and further improvement and deployment by the research community.

Download Full-text

Resource Description Framework reification for trustworthiness in knowledge graphs

F1000Research ◽

10.12688/f1000research.72843.1 ◽

2021 ◽

Vol 10 ◽

pp. 881

Author(s):

Sini Govindapillai ◽

Lay-Ki Soon ◽

Su-Cheng Haw

Keyword(s):

Resource Description Framework ◽

Structured Data ◽

Knowledge Graph ◽

Provenance Data ◽

Context Data ◽

Knowledge Graphs ◽

Description Framework ◽

Machine Readable ◽

Resource Description

Knowledge graph (KG) publishes machine-readable representation of knowledge on the Web. Structured data in the knowledge graph is published using Resource Description Framework (RDF) where knowledge is represented as a triple (subject, predicate, object). Due to the presence of erroneous, outdated or conflicting data in the knowledge graph, the quality of facts cannot be guaranteed. Therefore, the provenance of knowledge can assist in building up the trust of these knowledge graphs. In this paper, we have provided an analysis of popular, general knowledge graphs Wikidata and YAGO4 with regard to the representation of provenance and context data. Since RDF does not support metadata for providing provenance and contextualization, an alternate method, RDF reification is employed by most of the knowledge graphs. Trustworthiness of facts in knowledge graph can be enhanced by the addition of metadata like the source of information, location and time of the fact occurrence. Wikidata employs qualifiers to include metadata to facts, while YAGO4 collects metadata from Wikidata qualifiers. RDF reification increases the magnitude of data as several statements are required to represent a single fact. However, facts in Wikidata and YAGO4 can be fetched without using reification. Another limitation for applications that uses provenance data is that not all facts in these knowledge graphs are annotated with provenance data. Structured data in the knowledge graph is noisy. Therefore, the reliability of data in knowledge graphs can be increased by provenance data. To the best of our knowledge, this is the first paper that investigates the method and the extent of the addition of metadata of two prominent KGs, Wikidata and YAGO4.

Download Full-text

Contextual Autocomplete: A Novel User Interface Using Machine Learning to Improve Ontology Usage and Structured Data Capture for Presenting Problems in the Emergency Department

10.1101/127092 ◽

2017 ◽

Cited By ~ 1

Author(s):

Nathaniel R. Greenbaum ◽

Yacine Jernite ◽

Yoni Halpern ◽

Shelley Calder ◽

Larry A. Nathanson ◽

...

Keyword(s):

Machine Learning ◽

Emergency Department ◽

User Interface ◽

Vital Signs ◽

Structured Data ◽

Data Capture ◽

Free Text ◽

Presenting Problems ◽

Before And After

AbstractObjectiveTo determine the effect of contextual autocomplete, a user interface that uses machine learning, on the efficiency and quality of documentation of presenting problems (chief complaints) in the emergency department (ED).Materials and MethodsWe used contextual autocomplete, a user interface that ranks concepts by their predicted probability, to help nurses enter data about a patient’s reason for visiting the ED. Predicted probabilities were calculated using a previously derived model based on triage vital signs and a brief free text note. We evaluated the percentage and quality of structured data captured using a prospective before-and-after study design.ResultsA total of 279,231 patient encounters were analyzed. Structured data capture improved from 26.2% to 97.2% (p<0.0001). During the post-implementation period, presenting problems were more complete (3.35 vs 3.66; p=0.0004), as precise (3.59 vs. 3.74; p=0.1), and higher in overall quality (3.38 vs. 3.72; p=0.0002). Our system reduced the mean number of keystrokes required to document a presenting problem from 11.6 to 0.6 (p<0.0001), a 95% improvement.DiscussionWe have demonstrated a technique that captures structured data on nearly all patients. We estimate that our system reduces the number of man-hours required annually to type presenting problems at our institution from 92.5 hours to 4.8 hours.ConclusionImplementation of a contextual autocomplete system resulted in improved structured data capture, ontology usage compliance, and data quality.

Download Full-text

Improving Attention Mechanism in Graph Neural Networks via Cardinality Preservation

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/194 ◽

2020 ◽

Author(s):

Shuo Zhang ◽

Lei Xie

Keyword(s):

Neural Networks ◽

Theoretical Analysis ◽

Message Passing ◽

Representation Learning ◽

Attention Mechanism ◽

Structured Data ◽

Clear Understanding ◽

Graph Classification ◽

Competitive Performance ◽

Graph Neural Networks

Graph Neural Networks (GNNs) are powerful for the representation learning of graph-structured data. Most of the GNNs use a message-passing scheme, where the embedding of a node is iteratively updated by aggregating the information from its neighbors. To achieve a better expressive capability of node influences, attention mechanism has grown to be popular to assign trainable weights to the nodes in aggregation. Though the attention-based GNNs have achieved remarkable results in various tasks, a clear understanding of their discriminative capacities is missing. In this work, we present a theoretical analysis of the representational properties of the GNN that adopts the attention mechanism as an aggregator. Our analysis determines all cases when those attention-based GNNs can always fail to distinguish certain distinct structures. Those cases appear due to the ignorance of cardinality information in attention-based aggregation. To improve the performance of attention-based GNNs, we propose cardinality preserved attention (CPA) models that can be applied to any kind of attention mechanisms. Our experiments on node and graph classification confirm our theoretical analysis and show the competitive performance of our CPA models. The code is available online: https://github.com/zetayue/CPA.

Download Full-text

A Literature Review Study of Software Defect Prediction using Machine Learning Techniques

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i6.286 ◽

2018 ◽

Vol 6 (6) ◽

pp. 300 ◽

Cited By ~ 3

Author(s):

Feidu Akmel ◽

Ermiyas Birihanu ◽

Bahir Siraj

Keyword(s):

Machine Learning ◽

Software Metrics ◽

Quality Standard ◽

Machine Learning Techniques ◽

Software Systems ◽

Health Care Insurance ◽

Software Defect ◽

Learning Techniques ◽

Software Product

Software systems are any software product or applications that support business domains such as Manufacturing,Aviation, Health care, insurance and so on.Software quality is a means of measuring how software is designed and how well the software conforms to that design. Some of the variables that we are looking for software quality are Correctness, Product quality, Scalability, Completeness and Absence of bugs, However the quality standard that was used from one organization is different from other for this reason it is better to apply the software metrics to measure the quality of software. Attributes that we gathered from source code through software metrics can be an input for software defect predictor. Software defect are an error that are introduced by software developer and stakeholders. Finally, in this study we discovered the application of machine learning on software defect that we gathered from the previous research works.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31232/osf.io/4pxq2 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.

Download Full-text

An Unstructured to Structured Data Conversion using Machine Learning Algorithm in Internet of Things (IoT)

SSRN Electronic Journal ◽

10.2139/ssrn.3563389 ◽

2020 ◽

Author(s):

Saurav Verma ◽

Khushboo Jain ◽

Chetana Prakash

Keyword(s):

Machine Learning ◽

Internet Of Things ◽

Learning Algorithm ◽

Structured Data ◽

Data Conversion ◽

Machine Learning Algorithm

Download Full-text

Pollutants in Organic Chemistry and Medicinal Chemistry Education Laboratory. Experimental and Machine Learning Studies

Current Topics in Medicinal Chemistry ◽

10.2174/1568026620666200211110043 ◽

2020 ◽

Vol 20 (9) ◽

pp. 720-730

Author(s):

Iker Montes-Bageneta ◽

Urtzi Akesolo ◽

Sara López ◽

Maria Merino ◽

Eneritz Anakabe ◽

...

Keyword(s):

Organic Chemistry ◽

Machine Learning ◽

Chemistry Education ◽

Organic Waste ◽

Computational Modelling ◽

University Education ◽

Academic Factors ◽

Academic Year ◽

Statistical Analysis Software

Aims: Computational modelling may help us to detect the more important factors governing this process in order to optimize it. Background: The generation of hazardous organic waste in teaching and research laboratories poses a big problem that universities have to manage. Methods: In this work, we report on the experimental measurement of waste generation on the chemical education laboratories within our department. We measured the waste generated in the teaching laboratories of the Organic Chemistry Department II (UPV/EHU), in the second semester of the 2017/2018 academic year. Likewise, to know the anthropogenic and social factors related to the generation of waste, a questionnaire has been utilized. We focused on all students of Experimentation in Organic Chemistry (EOC) and Organic Chemistry II (OC2) subjects. It helped us to know their prior knowledge about waste, awareness of the problem of separate organic waste and the correct use of the containers. These results, together with the volumetric data, have been analyzed with statistical analysis software. We obtained two Perturbation-Theory Machine Learning (PTML) models including chemical, operational, and academic factors. The dataset analyzed included 6050 cases of laboratory practices vs. practices of reference. Results: These models predict the values of acetone waste with R2 = 0.88 and non-halogenated waste with R2 = 0.91. Conclusion: This work opens a new gate to the implementation of more sustainable techniques and a circular economy with the aim of improving the quality of university education processes.

Download Full-text

Recent Progress in Machine Learning-based Prediction of Peptide Activity for Drug Discovery

Current Topics in Medicinal Chemistry ◽

10.2174/1568026619666190122151634 ◽

2019 ◽

Vol 19 (1) ◽

pp. 4-16 ◽

Cited By ~ 6

Author(s):

Qihui Wu ◽

Hanzhong Ke ◽

Dongli Li ◽

Qi Wang ◽

Jiansong Fang ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Large Scale ◽

Recent Progress ◽

High Specificity ◽

Learning Approaches ◽

Anticancer Peptides ◽

The Past ◽

Traditional Approaches ◽

Large Scale Screening

Over the past decades, peptide as a therapeutic candidate has received increasing attention in drug discovery, especially for antimicrobial peptides (AMPs), anticancer peptides (ACPs) and antiinflammatory peptides (AIPs). It is considered that the peptides can regulate various complex diseases which are previously untouchable. In recent years, the critical problem of antimicrobial resistance drives the pharmaceutical industry to look for new therapeutic agents. Compared to organic small drugs, peptide- based therapy exhibits high specificity and minimal toxicity. Thus, peptides are widely recruited in the design and discovery of new potent drugs. Currently, large-scale screening of peptide activity with traditional approaches is costly, time-consuming and labor-intensive. Hence, in silico methods, mainly machine learning approaches, for their accuracy and effectiveness, have been introduced to predict the peptide activity. In this review, we document the recent progress in machine learning-based prediction of peptides which will be of great benefit to the discovery of potential active AMPs, ACPs and AIPs.

Download Full-text