scholarly journals An Agglomerative-adapted Partition Approach for Large-scale Graphs

2019 ◽  
Vol 4 (1) ◽  
pp. 3
Author(s):  
Chen Tao ◽  
Rongrong Shan ◽  
Hui Li ◽  
Dongsheng Wang ◽  
Wei Liu

In recent years, an increasing number of knowledge bases have been built using linked data, thus datasets have grown substantially. It is neither reasonable to store a large amount of triple data in a single graph, nor appropriate to store RDF in named graphs by class URIs, because many joins can cause performance problems between graphs. This paper presents an agglomerative-adapted approach for large-scale graphs, which is also a bottom-up merging process. The proposed algorithm can partition triples data in three levels: blank nodes, associated nodes, and inference nodes. Regarding blank nodes and classes/nodes involved in reasoning rules, it is better to store with an optimal neighbor node in the same partition instead of splitting into separate partitions. The process of merging associated nodes needs to start with the node in the smallest cost and then repeat it until the final number of partitions is met. Finally, the feasibility and rationality of the merging algorithm are analyzed in detail through bibliographic cases. In summary, the partitioning methods proposed in this paper can be applied in distributed storage, data retrieval, data export, and semantic reasoning of large-scale triples graphs. In the future, we will research the automation setting of the number of partitions with machine learning algorithms.

2021 ◽  
Vol 28 (1) ◽  
pp. e100251
Author(s):  
Ian Scott ◽  
Stacey Carter ◽  
Enrico Coiera

Machine learning algorithms are being used to screen and diagnose disease, prognosticate and predict therapeutic responses. Hundreds of new algorithms are being developed, but whether they improve clinical decision making and patient outcomes remains uncertain. If clinicians are to use algorithms, they need to be reassured that key issues relating to their validity, utility, feasibility, safety and ethical use have been addressed. We propose a checklist of 10 questions that clinicians can ask of those advocating for the use of a particular algorithm, but which do not expect clinicians, as non-experts, to demonstrate mastery over what can be highly complex statistical and computational concepts. The questions are: (1) What is the purpose and context of the algorithm? (2) How good were the data used to train the algorithm? (3) Were there sufficient data to train the algorithm? (4) How well does the algorithm perform? (5) Is the algorithm transferable to new clinical settings? (6) Are the outputs of the algorithm clinically intelligible? (7) How will this algorithm fit into and complement current workflows? (8) Has use of the algorithm been shown to improve patient care and outcomes? (9) Could the algorithm cause patient harm? and (10) Does use of the algorithm raise ethical, legal or social concerns? We provide examples where an algorithm may raise concerns and apply the checklist to a recent review of diagnostic imaging applications. This checklist aims to assist clinicians in assessing algorithm readiness for routine care and identify situations where further refinement and evaluation is required prior to large-scale use.


2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A62-A62
Author(s):  
Dattatreya Mellacheruvu ◽  
Rachel Pyke ◽  
Charles Abbott ◽  
Nick Phillips ◽  
Sejal Desai ◽  
...  

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.


2021 ◽  
Vol 54 (2) ◽  
pp. 35-47
Author(s):  
Svetlana N. Dvoryatkina ◽  
◽  
Arseny M. Lopukhin ◽  

The study actualized the complex and large-scale problem of adapting the theory of risk man-agement for the education system. A comprehensive analysis of domestic and international stud-ies revealed the lack of a theoretical framework, a general methodological vision of the problem of riskiness and risk-taking in the educational sphere. While effective management of education-al activities, ensuring the development of the competitiveness of the individual in the labor mar-ket and its potential for active participation in the life of society is possible on the basis of the modern paradigm of risk management, integrating achievements in pedagogical, economic, mathematical and computer sciences. A new methodology in the study is the fractal approach, which defines the idea of quantitative and qualitative analysis and assessment of the risk of non-formation of professional competencies, complex educational and cognitive constructs of subject activity. The fractal model of assessing the formation of knowledge and competencies, its risk landscape, taking into account the subject and cognitive divergence, will ensure the effective-ness of the structure of knowledge storage in the educational process, minimizing the time for building space and engineering knowledge bases, and the depth of solving the problem of pre-dicting educational risks. New methods of risk modeling based on machine learning algorithms and factor analysis, methods for constructing neural integrators, quantitative methods with and without taking into account the probability distribution will ensure the accuracy and speed of risk assessment and prediction, will allow one to identify new patterns of risk activity and further ways to develop the theory of risk. The presented effective strategies and innovative tools will solve the problem of minimizing unplanned chaos, the cascade of negative consequences of risky situations, including the COVID-19 epidemic.


Author(s):  
Manjunath Thimmasandra Narayanapppa ◽  
T. P. Puneeth Kumar ◽  
Ravindra S. Hegadi

Recent technological advancements have led to generation of huge volume of data from distinctive domains (scientific sensors, health care, user-generated data, finical companies and internet and supply chain systems) over the past decade. To capture the meaning of this emerging trend the term big data was coined. In addition to its huge volume, big data also exhibits several unique characteristics as compared with traditional data. For instance, big data is generally unstructured and require more real-time analysis. This development calls for new system platforms for data acquisition, storage, transmission and large-scale data processing mechanisms. In recent years analytics industries interest expanding towards the big data analytics to uncover potentials concealed in big data, such as hidden patterns or unknown correlations. The main goal of this chapter is to explore the importance of machine learning algorithms and computational environment including hardware and software that is required to perform analytics on big data.


Author(s):  
Bradford William Hesse

The presence of large-scale data systems can be felt, consciously or not, in almost every facet of modern life, whether through the simple act of selecting travel options online, purchasing products from online retailers, or navigating through the streets of an unfamiliar neighborhood using global positioning system (GPS) mapping. These systems operate through the momentum of big data, a term introduced by data scientists to describe a data-rich environment enabled by a superconvergence of advanced computer-processing speeds and storage capacities; advanced connectivity between people and devices through the Internet; the ubiquity of smart, mobile devices and wireless sensors; and the creation of accelerated data flows among systems in the global economy. Some researchers have suggested that big data represents the so-called fourth paradigm in science, wherein the first paradigm was marked by the evolution of the experimental method, the second was brought about by the maturation of theory, the third was marked by an evolution of statistical methodology as enabled by computational technology, while the fourth extended the benefits of the first three, but also enabled the application of novel machine-learning approaches to an evidence stream that exists in high volume, high velocity, high variety, and differing levels of veracity. In public health and medicine, the emergence of big data capabilities has followed naturally from the expansion of data streams from genome sequencing, protein identification, environmental surveillance, and passive patient sensing. In 2001, the National Committee on Vital and Health Statistics published a road map for connecting these evidence streams to each other through a national health information infrastructure. Since then, the road map has spurred national investments in electronic health records (EHRs) and motivated the integration of public surveillance data into analytic platforms for health situational awareness. More recently, the boom in consumer-oriented mobile applications and wireless medical sensing devices has opened up the possibility for mining new data flows directly from altruistic patients. In the broader public communication sphere, the ability to mine the digital traces of conversation on social media presents an opportunity to apply advanced machine learning algorithms as a way of tracking the diffusion of risk communication messages. In addition to utilizing big data for improving the scientific knowledge base in risk communication, there will be a need for health communication scientists and practitioners to work as part of interdisciplinary teams to improve the interfaces to these data for professionals and the public. Too much data, presented in disorganized ways, can lead to what some have referred to as “data smog.” Much work will be needed for understanding how to turn big data into knowledge, and just as important, how to turn data-informed knowledge into action.


10.29007/lt5p ◽  
2019 ◽  
Author(s):  
Sophie Siebert ◽  
Frieder Stolzenburg

Commonsense reasoning is an everyday task that is intuitive for humans but hard to implement for computers. It requires large knowledge bases to get the required data from, although this data is still incomplete or even inconsistent. While machine learning algorithms perform rather well on these tasks, the reasoning process remains a black box. To close this gap, our system CoRg aims to build an explainable and well-performing system, which consists of both an explainable deductive derivation process and a machine learning part. We conduct our experiments on the Copa question-answering benchmark using the ontologies WordNet, Adimen-SUMO, and ConceptNet. The knowledge is fed into the theorem prover Hyper and in the end the conducted models will be analyzed using machine learning algorithms, to derive the most probable answer.


Author(s):  
Emir Kocer ◽  
Tsz Wai Ko ◽  
Jörg Behler

In the past two decades, machine learning potentials (MLPs) have reached a level of maturity that now enables applications to large-scale atomistic simulations of a wide range of systems in chemistry, physics, and materials science. Different machine learning algorithms have been used with great success in the construction of these MLPs. In this review, we discuss an important group of MLPs relying on artificial neural networks to establish a mapping from the atomic structure to the potential energy. In spite of this common feature, there are important conceptual differences among MLPs, which concern the dimensionality of the systems, the inclusion of long-range electrostatic interactions, global phenomena like nonlocal charge transfer, and the type of descriptor used to represent the atomic structure, which can be either predefined or learnable. A concise overview is given along with a discussion of the open challenges in the field. Expected final online publication date for the Annual Review of Physical Chemistry, Volume 73 is April 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.


Sign in / Sign up

Export Citation Format

Share Document