Unitarism vs. Individuality and a New Digital Agenda: The Power of Decentralized Web

Frontiers in Human Dynamics ◽

10.3389/fhumd.2021.626299 ◽

2021 ◽

Vol 3 ◽

Author(s):

Anastassia Lauterbach

Keyword(s):

Machine Learning ◽

Data Exchange ◽

Large Scale ◽

Data Science ◽

Personal Information ◽

Compulsory Education ◽

Small Data ◽

Data Sets ◽

Cultural Aspects ◽

Learning Capabilities

Discussions around Covid-19 apps and models demonstrated that primary challenges for AI and data science focused on governance and ethics. Personal information was involved in building data sets. It was unclear how this information could be utilized in large scale models to provide predictions and insights while observing privacy requirements. Most people expected a lot from technology but were unwilling to sacrifice part of their privacy for building it. Conversely, regulators and policy makers require AI and data science practitioners to ensure optimal public health, national security while avoiding these privacy-related struggles. Their choices vary largely from country to country and are driven more by cultural aspects, and less by machine learning capabilities. The question is whether current ways to design technology and work with data sets are sustainable and lead to a good outcome for individuals and their communities. At the same time Covid-19 made it obvious that economies and societies cannot succeed without far-reaching digital policies, touching every aspect of how we provide and receive education, live, and work. Most regions, businesses and individuals struggled to benefit from competitive capabilities modern data technologies could bring. This opinion paper suggests how Germany and Europe can rethink their digital policy while recognizing the value of data, introducing Data IDs for consumers and businesses, committing to support innovation in decentralized data technologies, introducing concepts of Data Trusts and compulsory education around data starting from the early school age. Besides, it discusses advantages of data-tokens to shape a new ecosystem for decentralized data exchange. Furthermore, it emphasizes the necessity to develop and promote technologies to work with small data sets and handle data in compliance with privacy regulations, keeping in mind costs for the environment while bidding on big data and large-scale machine learning models. Finally, innovation as an integral part of any data scientist's job will be called for.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

Artificial Intelligence, Machine Learning and Data Science as Iterations of Business Automation for Small Businesses

Management of Data in AI Age ◽

10.46679/isbn978819484834904 ◽

2020 ◽

pp. 87-94

Author(s):

Pooja Sharma ◽

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Multinational Corporations ◽

Small Businesses ◽

Data Science ◽

Small Data ◽

Decision Systems ◽

Business Operations ◽

Business Automation ◽

Machine Learning Tool

Artificial intelligence and machine learning, the two iterations of automation are based on the data, small or large. The larger the data, the more effective an AI or machine learning tool will be. The opposite holds the opposite iteration. With a larger pool of data, the large businesses and multinational corporations have effectively been building, developing and adopting refined AI and machine learning based decision systems. The contention of this chapter is to explore if the small businesses with small data in hands are well-off to use and adopt AI and machine learning based tools for their day to day business operations.

Download Full-text

Data Science and Designing for Privacy

Techné: Research in Philosophy and Technology ◽

10.5840/techne201632446 ◽

2016 ◽

Vol 20 (1) ◽

pp. 51-68

Author(s):

Michael Falgoust ◽

Keyword(s):

Data Science ◽

Personal Information ◽

Information Age ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Informational Privacy ◽

Design Constraint ◽

Classical Models ◽

Privacy Problem

Unprecedented advances in the ability to store, analyze, and retrieve data is the hallmark of the information age. Along with enhanced capability to identify meaningful patterns in large data sets, contemporary data science renders many classical models of privacy protection ineffective. Addressing these issues through privacy-sensitive design is insufficient because advanced data science is mutually exclusive with preserving privacy. The special privacy problem posed by data analysis has so far escaped even leading accounts of informational privacy. Here, I argue that accounts of privacy must include norms about information processing in addition to norms about information flow. Ultimately, users need the resources to control how and when personal information is processed and the knowledge to make information decisions about that control. While privacy is an insufficient design constraint, value-sensitive design around control and transparency can support privacy in the information age.

Download Full-text

Large-Scale Machine Learning Algorithms for Biomedical Data Science

Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics - BCB '19 ◽

10.1145/3307339.3342130 ◽

2019 ◽

Author(s):

Heng Huang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Data Science ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Biomedical Data

Download Full-text

Using Uncertain DM-Chameleon Clustering Algorithm Based on Machine Learning to Predict Landslide Hazards

Journal of Robotics and Mechatronics ◽

10.20965/jrm.2019.p0329 ◽

2019 ◽

Vol 31 (2) ◽

pp. 329-338 ◽

Cited By ~ 1

Author(s):

Jian Hu ◽

Haiwan Zhu ◽

Yimin Mao ◽

Canlong Zhang ◽

Tian Liang ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Clustering Algorithm ◽

Uncertain Data ◽

Landslide Hazard ◽

Data Sets ◽

Large Scale Data ◽

Landslide Hazards ◽

Hazard Levels ◽

Scale Data

Landslide hazard prediction is a difficult, time-consuming process when traditional methods are used. This paper presents a method that uses machine learning to predict landslide hazard levels automatically. Due to difficulties in obtaining and effectively processing rainfall in landslide hazard prediction, and to the existing limitation in dealing with large-scale data sets in the M-chameleon algorithm, a new method based on an uncertain DM-chameleon algorithm (developed M-chameleon) is proposed to assess the landslide susceptibility model. First, this method designs a new two-phase clustering algorithm based on M-chameleon, which effectively processes large-scale data sets. Second, the new E-H distance formula is designed by combining the Euclidean and Hausdorff distances, and this enables the new method to manage uncertain data effectively. The uncertain data model is presented at the same time to effectively quantify triggering factors. Finally, the model for predicting landslide hazards is constructed and verified using the data from the Baota district of the city of Yan’an, China. The experimental results show that the uncertain DM-chameleon algorithm of machine learning can effectively improve the accuracy of landslide prediction and has high feasibility. Furthermore, the relationships between hazard factors and landslide hazard levels can be extracted based on clustering results.

Download Full-text

Large-Scale Analysis of Genetic and Clinical Patient Data

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-080917-013508 ◽

2018 ◽

Vol 1 (1) ◽

pp. 263-274 ◽

Cited By ~ 6

Author(s):

Marylyn D. Ritchie

Keyword(s):

Clinical Data ◽

Large Scale ◽

Data Science ◽

Genomic Analysis ◽

Genomic Data ◽

Data Sets ◽

Biomedical Data ◽

Data Types ◽

Phenotypic Data ◽

Clinical Patient

Biomedical data science has experienced an explosion of new data over the past decade. Abundant genetic and genomic data are increasingly available in large, diverse data sets due to the maturation of modern molecular technologies. Along with these molecular data, dense, rich phenotypic data are also available on comprehensive clinical data sets from health care provider organizations, clinical trials, population health registries, and epidemiologic studies. The methods and approaches for interrogating these large genetic/genomic and clinical data sets continue to evolve rapidly, as our understanding of the questions and challenges continue to emerge. In this review, the state-of-the-art methodologies for genetic/genomic analysis along with complex phenomics will be discussed. This field is changing and adapting to the novel data types made available, as well as technological advances in computation and machine learning. Thus, I will also discuss the future challenges in this exciting and innovative space. The promises of precision medicine rely heavily on the ability to marry complex genetic/genomic data with clinical phenotypes in meaningful ways.

Download Full-text

FLEXGREPPS — FLEXIBLE GREEDY PEPTIDE POOL SEARCH: COMPUTATION OF NEAR-OPTIMAL SETS OF DEGENERATE POLYPEPTIDES FOR ANTIGENIC SCREENING

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012500096 ◽

2012 ◽

Vol 10 (05) ◽

pp. 1250009

Author(s):

WILLIAM KRIVAN ◽

NICK ARNOLD ◽

CECILE MORALES ◽

DARRICK CARTER

Keyword(s):

Large Scale ◽

Selection Process ◽

Peptide Pool ◽

Small Data ◽

Data Sets ◽

Main Body ◽

Degenerate Primers ◽

Computational Performance ◽

Wet Lab ◽

Dna Primers

Although synthesizing and utilizing individual peptides and DNA primers has become relatively inexpensive, massively parallel probing and next-generation sequencing approaches have dramatically increased the number of molecules that can be subjected to screening; this, in turn, requires vast numbers of peptides and therefore results in significant expenses. To alleviate this issue, pools of related molecules are often used to downselect prior to testing individual sequences. A computational selection process to create pools of related sequences at large scale has not been reported for peptides. In the case of PCR primers, there have been successful attempts to address this problem by designing degenerate primers that can be produced at the same cost as conventional, unique primers and then be used to amplify several different genomic regions. We present an algorithm, "FlexGrePPS" (Flexible Greedy Peptide Pool Search), that can create a near-optimal set of peptide pools. This approach is also applicable to nucleotide sequences and outperforms most DNA primer selection programs. For the proteomic compression with FlexGrePPS, the main body of our work presented here, we demonstrate the feasibility of the computation of an exhaustive cover of pathogenic proteomes with degenerate peptides that lend themselves to antigenic screening. Furthermore, we present preliminary data that demonstrate the experimental utility of highly degenerate peptides for antigenic screening. FlexGrePPS provides a near-optimal solution for proteomic compression and there are no programs available for comparison. We also demonstrate computational performance of our GreedyPrime implementation, which is a modified version of FlexGrePPS applicable to the design of degenerate primers and is comparable to existing programs for the design of degenerate primers. Specifically, we focus on the comparisons with PAMPS and DPS-DIP, software tools that have recently been shown to be superior to other methods. FlexGrePPS forms the foundation of a novel antigenic screening methodology that is based on the representation of an entire proteome by near-optimal degenerate peptide pools. Our preliminary wet lab data indicate that the approach will likely prove successful in comprehensive wet lab studies, and hence will dramatically reduce the expenses for antigenic screening and make whole proteome screening feasible. Although FlexGrePPS was designed for computational performance in order to handle vast data sets, there is the very surprising finding that even for small data sets the primer design version of FlexGrePPS, GreedyPrime, offers similar or even superior results for MP-DPD and most MDPD instances when compared to existing methods; despite the much longer run times, other approaches did not fare significantly better in reducing the original data sets to degenerate primers. The FlexGrePPS and GreedyPrime programs are available at no charge under the GNU LGPL license at http://sourceforge.net/projects/flexgrepps/ .

Download Full-text

Data analytics accelerates the experimental discovery of new thermoelectric materials with extremely high figure of merit

10.21203/rs.3.rs-926972/v1 ◽

2021 ◽

Author(s):

Sergey Levchenko ◽

Yaqiong Zhong ◽

Xiaojuan Hu ◽

Debalaya Sarker ◽

Qingrui Xia ◽

...

Keyword(s):

Thermoelectric Materials ◽

Figure Of Merit ◽

Large Scale ◽

Chemical Space ◽

Small Data ◽

Data Sets ◽

High Performing ◽

Learning Framework ◽

Small Data Sets ◽

P Type

Abstract Thermoelectric (TE) materials are among very few sustainable yet feasible energy solutions of present time. This huge promise of energy harvesting is contingent on identifying/designing materials having higher efficiency than presently available ones. However, due to the vastness of the chemical space of materials, only its small fraction was scanned experimentally and/or computationally so far. Employing a compressed-sensing based symbolic regression in an active-learning framework, we have not only identified a trend in materials’ compositions for superior TE performance, but have also predicted and experimentally synthesized several extremely high performing novel TE materials. Among these, we found polycrystalline p-type Cu0.45Ag0.55GaTe2 to possess an experimental figure of merit as high as ~2.8 at 827 K. This is a breakthrough in the field, because all previously known thermoelectric materials with a comparable figure of merit are either unstable or much more difficult to synthesize, rendering them unusable in large-scale applications. The presented methodology demonstrates the importance and tremendous potential of physically informed descriptors in material science, in particular for relatively small data sets typically available from experiments at well-controlled conditions.

Download Full-text

A deep learning and novelty detection framework for rapid phenotyping in high-content screening

10.1101/134627 ◽

2017 ◽

Cited By ~ 2

Author(s):

Christoph Sommer ◽

Rudolf Hoefler ◽

Matthias Samwer ◽

Daniel W. Gerlich

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Large Scale ◽

Novelty Detection ◽

A Priori ◽

Mitotic Cell ◽

Supervised Machine Learning ◽

High Content Screening ◽

Data Sets ◽

User Training

AbstractSupervised machine learning is a powerful and widely used method to analyze high-content screening data. Despite its accuracy, efficiency, and versatility, supervised machine learning has drawbacks, most notably its dependence on a priori knowledge of expected phenotypes and time-consuming classifier training. We provide a solution to these limitations with CellCognition Explorer, a generic novelty detection and deep learning framework. Application to several large-scale screening data sets on nuclear and mitotic cell morphologies demonstrates that CellCognition Explorer enables discovery of rare phenotypes without user training, which has broad implications for improved assay development in high-content screening.

Download Full-text

LogicSNN: A Unified Spiking Neural Networks Logical Operation Paradigm

Electronics ◽

10.3390/electronics10172123 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2123 ◽

Cited By ~ 1

Author(s):

Lingfei Mo ◽

Minghao Wang

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Network Structure ◽

Large Scale ◽

Spike Timing ◽

Spiking Neural Networks ◽

Logical Operation ◽

Large Scale Network ◽

Learning Capabilities ◽

Scale Network

LogicSNN, a unified spiking neural networks (SNN) logical operation paradigm is proposed in this paper. First, we define the logical variables under the semantics of SNN. Then, we design the network structure of this paradigm and use spike-timing-dependent plasticity for training. According to this paradigm, six kinds of basic SNN binary logical operation modules and three kinds of combined logical networks based on these basic modules are implemented. Through these experiments, the rationality, cascading characteristics and the potential of building large-scale network of this paradigm are verified. This study fills in the blanks of the logical operation of SNN and provides a possible way to realize more complex machine learning capabilities.

Download Full-text