Learning for Tail Label Data: A Label-Specific Feature Approach

Tail label data (TLD) is prevalent in real-world tasks, and large-scale multi-label learning (LMLL) is its major learning scheme. Previous LMLL studies typically need to additionally take into account extensive head label data (HLD), and thus fail to guide the learning behavior of TLD. In many applications such as recommender systems, however, the prediction of tail label is very necessary, since it provides very important supplementary information. We call this kind of problem as \emph{tail label learning}. In this paper, we propose a novel method for the tail label learning problem. Based on the observation that the raw feature representation in LMLL data usually benefits HLD, which may not be suitable for TLD, we construct effective and rich label-specific features through exploring labeled data distribution and leveraging label correlations. Specifically, we employ clustering analysis to explore discriminative features for each tail label replacing the original high-dimensional and sparse features. In addition, due to the scarcity of positive examples of TLD, we encode knowledge from HLD by exploiting label correlations to enhance the label-specific features. Experimental results verify the superiority of the proposed method in terms of performance on TLD.

Download Full-text

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

Bioinformatics ◽

10.1093/bioinformatics/btz964 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2337-2344 ◽

Cited By ~ 1

Author(s):

Gleb Goussarov ◽

Ilse Cleenwerck ◽

Mohamed Mysara ◽

Natalie Leys ◽

Pieter Monsieurs ◽

...

Keyword(s):

Large Scale ◽

Bacterial Species ◽

Supplementary Information ◽

Nucleotide Identity ◽

Average Nucleotide Identity ◽

Bacterial Genomes ◽

Short Oligonucleotide ◽

Novel Approach ◽

Novel Method ◽

Alignment Step

Abstract Motivation One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. Results Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. Availability and implementation The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Protein–ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data

Bioinformatics ◽

10.1093/bioinformatics/btaa110 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3018-3027 ◽

Cited By ~ 2

Author(s):

Chun-Qiu Xia ◽

Xiaoyong Pan ◽

Hong-Bin Shen

Keyword(s):

Ligand Binding ◽

Large Scale ◽

Protein Structures ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Feature Representation ◽

Structure Alignment ◽

Supplementary Information ◽

Binding Residue ◽

Binding Residue Prediction

Abstract Motivation Knowledge of protein–ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein–ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data. Results In this study, we propose a novel deep-learning-based method called DELIA for protein–ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline. Availability and implementation The web server of DELIA is available at www.csbio.sjtu.edu.cn/bioinf/delia/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Variation Generalized Feature Learning via Intra-view Variation Adaptation

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/116 ◽

2019 ◽

Author(s):

Jiawei Li ◽

Mang Ye ◽

Andy Jinhua Ma ◽

Pong C Yuen

Keyword(s):

Large Scale ◽

Learning Algorithm ◽

Feature Learning ◽

Information Loss ◽

Positive Sample ◽

Feature Representation ◽

Fusion Method ◽

Learning Problem ◽

Detection Algorithms ◽

Novel Variation

This paper addresses the variation generalized feature learning problem in unsupervised video-based person re-identification (re-ID). With advanced tracking and detection algorithms, large-scale intra-view positive samples can be easily collected by assuming that the image frames within the tracking sequence belong to the same person. Existing methods either directly use the intra-view positives to model cross-view variations or simply minimize the intra-view variations to capture the invariant component with some discriminative information loss. In this paper, we propose a Variation Generalized Feature Learning (VGFL) method to learn adaptable feature representation with intra-view positives. The proposed method can learn a discriminative re-ID model without any manually annotated cross-view positive sample pairs. It could address the unseen testing variations with a novel variation generalized feature learning algorithm. In addition, an Adaptability-Discriminability (AD) fusion method is introduced to learn adaptable video-level features. Extensive experiments on different datasets demonstrate the effectiveness of the proposed method.

Download Full-text

Morphing projections: a new visual technique for fast and interactive large-scale analysis of biomedical datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa989 ◽

2020 ◽

Author(s):

Ignacio Díaz ◽

José M Enguita ◽

Ana González ◽

Diego García ◽

Abel A Cuadrado ◽

...

Keyword(s):

Data Analytics ◽

Domain Knowledge ◽

Large Scale ◽

User Interaction ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

High Dimensional ◽

Biomedical Data ◽

Efficient Manner ◽

Large Scale Analysis

Abstract Motivation Biomedical research entails analyzing high dimensional records of biomedical features with hundreds or thousands of samples each. This often involves using also complementary clinical metadata, as well as a broad user domain knowledge. Common data analytics software makes use of machine learning algorithms or data visualization tools. However, they are frequently one-way analyses, providing little room for the user to reconfigure the steps in light of the observed results. In other cases, reconfigurations involve large latencies, requiring a retraining of algorithms or a large pipeline of actions. The complex and multiway nature of the problem, nonetheless, suggests that user interaction feedback is a key element to boost the cognitive process of analysis, and must be both broad and fluid. Results In this article, we present a technique for biomedical data analytics, based on blending meaningful views in an efficient manner, allowing to provide a natural smooth way to transition among different but complementary representations of data and knowledge. Our hypothesis is that the confluence of diverse complementary information from different domains on a highly interactive interface allows the user to discover relevant relationships or generate new hypotheses to be investigated by other means. We illustrate the potential of this approach with three case studies involving gene expression data and clinical metadata, as representative examples of high dimensional, multidomain, biomedical data. Availability and implementation Code and demo app to reproduce the results available at https://gitlab.com/idiazblanco/morphing-projections-demo-and-dataset-preparation. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

Enzymatic Synthesis of 15N-L-aspartic Acid Using Recombinant Aspartase from Escherichia Coli K12

Revista de Chimie ◽

10.37358/rc.08.11.2004 ◽

2008 ◽

Vol 59 (11) ◽

Cited By ~ 1

Author(s):

Iulia Lupan ◽

Sergiu Chira ◽

Maria Chiriac ◽

Nicolae Palibroda ◽

Octavian Popescu

Keyword(s):

Amino Acids ◽

Aspartic Acid ◽

Enzymatic Synthesis ◽

Large Scale ◽

Recombinant Dna ◽

Industrial Enzymes ◽

Recombinant Dna Technology ◽

Bacterial Fermentation ◽

Natural Protein ◽

Novel Method

Amino acids are obtained by bacterial fermentation, extraction from natural protein or enzymatic synthesis from specific substrates. With the introduction of recombinant DNA technology, it has become possible to apply more rational approaches to enzymatic synthesis of amino acids. Aspartase (L-aspartate ammonia-lyase) catalyzes the reversible deamination of L-aspartic acid to yield fumaric acid and ammonia. It is one of the most important industrial enzymes used to produce L-aspartic acid on a large scale. Here we described a novel method for [15N] L-aspartic synthesis from fumarate and ammonia (15NH4Cl) using a recombinant aspartase.

Download Full-text

A Novel Method to Predict Drug-Target Interactions Based on Large-Scale Graph Representation Learning

Cancers ◽

10.3390/cancers13092111 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2111

Author(s):

Bo-Wei Zhao ◽

Zhu-Hong You ◽

Lun Hu ◽

Zhen-Hao Guo ◽

Lei Wang ◽

...

Keyword(s):

Drug Target ◽

Large Scale ◽

Computational Models ◽

Structural Information ◽

Characteristic Curve ◽

Representation Learning ◽

Graph Representation ◽

Convolutional Network ◽

Novel Method

Identification of drug-target interactions (DTIs) is a significant step in the drug discovery or repositioning process. Compared with the time-consuming and labor-intensive in vivo experimental methods, the computational models can provide high-quality DTI candidates in an instant. In this study, we propose a novel method called LGDTI to predict DTIs based on large-scale graph representation learning. LGDTI can capture the local and global structural information of the graph. Specifically, the first-order neighbor information of nodes can be aggregated by the graph convolutional network (GCN); on the other hand, the high-order neighbor information of nodes can be learned by the graph embedding method called DeepWalk. Finally, the two kinds of feature are fed into the random forest classifier to train and predict potential DTIs. The results show that our method obtained area under the receiver operating characteristic curve (AUROC) of 0.9455 and area under the precision-recall curve (AUPR) of 0.9491 under 5-fold cross-validation. Moreover, we compare the presented method with some existing state-of-the-art methods. These results imply that LGDTI can efficiently and robustly capture undiscovered DTIs. Moreover, the proposed model is expected to bring new inspiration and provide novel perspectives to relevant researchers.

Download Full-text

A Novel Method Facilitating the Simple and Low-Cost Preparation of Human Osteochondral Slice Explants for Large-Scale Native Tissue Analysis

International Journal of Molecular Sciences ◽

10.3390/ijms22126394 ◽

2021 ◽

Vol 22 (12) ◽

pp. 6394

Author(s):

Jacob Spinnen ◽

Lennard K. Shopperly ◽

Carsten Rendenbach ◽

Anja A. Kühl ◽

Ufuk Sentürk ◽

...

Keyword(s):

Gene Expression ◽

Cell Viability ◽

Large Scale ◽

Marker Gene ◽

Cell Swelling ◽

Tissue Cell ◽

Joint Research ◽

Sample Number ◽

Tnf Α ◽

Novel Method

For in vitro modeling of human joints, osteochondral explants represent an acceptable compromise between conventional cell culture and animal models. However, the scarcity of native human joint tissue poses a challenge for experiments requiring high numbers of samples and makes the method rather unsuitable for toxicity analyses and dosing studies. To scale their application, we developed a novel method that allows the preparation of up to 100 explant cultures from a single human sample with a simple setup. Explants were cultured for 21 days, stimulated with TNF-α or TGF-β3, and analyzed for cell viability, gene expression and histological changes. Tissue cell viability remained stable at >90% for three weeks. Proteoglycan levels and gene expression of COL2A1, ACAN and COMP were maintained for 14 days before decreasing. TNF-α and TGF-β3 caused dose-dependent changes in cartilage marker gene expression as early as 7 days. Histologically, cultures under TNF-α stimulation showed a 32% reduction in proteoglycans, detachment of collagen fibers and cell swelling after 7 days. In conclusion, thin osteochondral slice cultures behaved analogously to conventional punch explants despite cell stress exerted during fabrication. In pharmacological testing, both the shorter diffusion distance and the lack of need for serum in the culture suggest a positive effect on sensitivity. The ease of fabrication and the scalability of the sample number make this manufacturing method a promising platform for large-scale preclinical testing in joint research.

Download Full-text

A Novel Method for the Preparation of Fibrous CeO2–ZrO2–Y2O3 Compacts for Thermochemical Cycles

Crystals ◽

10.3390/cryst11080885 ◽

2021 ◽

Vol 11 (8) ◽

pp. 885

Author(s):

Nicole Knoblauch ◽

Peter Mechnich

Keyword(s):

Large Scale ◽

Preparation Method ◽

Cost Effective ◽

High Porosity ◽

Thermochemical Cycles ◽

Redox Cycles ◽

Redox Kinetics ◽

Direct Infiltration ◽

Novel Method ◽

Functional Components

Zirconium-Yttrium-co-doped ceria (Ce0.85Zr0.13Y0.02O1.99) compacts consisting of fibers with diameters in the range of 8–10 µm have been successfully prepared by direct infiltration of commercial YSZ fibers with a cerium oxide matrix and subsequent sintering. The resulting chemically homogeneous fiber-compacts are sinter-resistant up to 1923 K and retain a high porosity of around 58 vol% and a permeability of 1.6–3.3 × 10−10 m² at a pressure gradient of 100–500 kPa. The fiber-compacts show a high potential for the application in thermochemical redox cycling due its fast redox kinetics. The first evaluation of redox kinetics shows that the relaxation time of oxidation is five times faster than that of dense samples of the same composition. The improved gas exchange due to the high porosity also allows higher reduction rates, which enable higher hydrogen yields in thermochemical water-splitting redox cycles. The presented cost-effective fiber-compact preparation method is considered very promising for manufacturing large-scale functional components for solar-thermal high-temperature reactors.

Download Full-text

A Novel Method for Multispectral Image Pansharpening Based on High Dimensional Model Representation

Expert Systems with Applications ◽

10.1016/j.eswa.2020.114512 ◽

2020 ◽

pp. 114512

Author(s):

Evrim Korkmaz Özay ◽

Burcu Tunga

Keyword(s):

Multispectral Image ◽

Model Representation ◽

High Dimensional ◽

High Dimensional Model Representation ◽

Dimensional Model ◽

Novel Method

Download Full-text