Improved Constrained k-Means Algorithm for Clustering with Domain Knowledge

Witnessing the tremendous development of machine learning technology, emerging machine learning applications impose challenges of using domain knowledge to improve the accuracy of clustering provided that clustering suffers a compromising accuracy rate despite its advantage of fast procession. In this paper, we model domain knowledge (i.e., background knowledge or side information), respecting some applications as must-link and cannot-link sets, for the sake of collaborating with k-means for better accuracy. We first propose an algorithm for constrained k-means, considering only must-links. The key idea is to consider a set of data points constrained by the must-links as a single data point with a weight equal to the weight sum of the constrained points. Then, for clustering the data points set with cannot-link, we employ minimum-weight matching to assign the data points to the existing clusters. At last, we carried out a numerical simulation to evaluate the proposed algorithms against the UCI datasets, demonstrating that our method outperforms the previous algorithms for constrained k-means as well as the traditional k-means regarding the clustering accuracy rate although with a slightly compromised practical runtime.

Download Full-text

Smooth input preparation for quantum and quantum-inspired machine learning

Quantum Machine Intelligence ◽

10.1007/s42484-021-00045-x ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Zhikuan Zhao ◽

Jack K. Fitzsimons ◽

Patrick Rebentrost ◽

Vedran Dunjko ◽

Joseph F. Fitzsimons

Keyword(s):

Machine Learning ◽

Quantum Algorithms ◽

Machine Learning Algorithms ◽

Low Rank ◽

Smoothed Analysis ◽

Machine Learning Applications ◽

Data Points ◽

Input Model ◽

The Cost ◽

Input Perturbation

AbstractMachine learning has recently emerged as a fruitful area for finding potential quantum computational advantage. Many of the quantum-enhanced machine learning algorithms critically hinge upon the ability to efficiently produce states proportional to high-dimensional data points stored in a quantum accessible memory. Even given query access to exponentially many entries stored in a database, the construction of which is considered a one-off overhead, it has been argued that the cost of preparing such amplitude-encoded states may offset any exponential quantum advantage. Here we prove using smoothed analysis that if the data analysis algorithm is robust against small entry-wise input perturbation, state preparation can always be achieved with constant queries. This criterion is typically satisfied in realistic machine learning applications, where input data is subjective to moderate noise. Our results are equally applicable to the recent seminal progress in quantum-inspired algorithms, where specially constructed databases suffice for polylogarithmic classical algorithm in low-rank cases. The consequence of our finding is that for the purpose of practical machine learning, polylogarithmic processing time is possible under a general and flexible input model with quantum algorithms or quantum-inspired classical algorithms in the low-rank cases.

Download Full-text

Programming Language Support for Implementing Machine Learning Algorithms

Handbook of Research on Applications and Implementations of Machine Learning Techniques - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-9902-9.ch021 ◽

2020 ◽

pp. 402-421

Author(s):

Anitha Elavarasi S. ◽

Jayanthi J.

Keyword(s):

Machine Learning ◽

Programming Languages ◽

Programming Language ◽

Domain Knowledge ◽

Learning Algorithm ◽

Mathematical Optimization ◽

Machine Learning Algorithms ◽

Application Development ◽

Machine Learning Applications ◽

And Control

Machine learning provides the system to automatically learn without human intervention and improve their performance with the help of previous experience. It can access the data and use it for learning by itself. Even though many algorithms are developed to solve machine learning issues, it is difficult to handle all kinds of inputs data in-order to arrive at accurate decisions. The domain knowledge of statistical science, probability, logic, mathematical optimization, reinforcement learning, and control theory plays a major role in developing machine learning based algorithms. The key consideration in selecting a suitable programming language for implementing machine learning algorithm includes performance, concurrence, application development, learning curve. This chapter deals with few of the top programming languages used for developing machine learning applications. They are Python, R, and Java. Top three programming languages preferred by data scientist are (1) Python more than 57%, (2) R more than 31%, and (3) Java used by 17% of the data scientists.

Download Full-text

Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

10.21203/rs.3.rs-857976/v1 ◽

2021 ◽

Author(s):

Inyoung Kim ◽

Sang Yoon Byun ◽

Sangyeup Kim ◽

Sangyoon Choi ◽

Jinsung Noh ◽

...

Keyword(s):

B Cell ◽

Cell Receptor ◽

B Cell Receptor ◽

Classification Model ◽

Accuracy Rate ◽

New Approach ◽

Single Data Point ◽

Nucleotide Insertions ◽

Single Data ◽

Vector Representations

Abstract Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.

Download Full-text

Information modeling and knowledge extraction for machine learning applications in industrial production systems

Machine Learning for Cyber Physical Systems - Technologien für die intelligente Automation ◽

10.1007/978-3-662-62746-4_8 ◽

2020 ◽

pp. 73-81

Author(s):

Stefan Windmann ◽

Christian Kühnert

Keyword(s):

Machine Learning ◽

Domain Knowledge ◽

Structural Information ◽

Semantic Annotation ◽

Information Model ◽

Knowledge Extraction ◽

Information Modeling ◽

Sensor Data ◽

Tool Chain ◽

Machine Learning Applications

AbstractIn this paper, a new information model for machine learning applications is introduced, which allows for a consistent acquisition and semantic annotation of process data, structural information and domain knowledge from industrial productions systems. The proposed information model is based on Industry 4.0 components and IEC 61360 component descriptions. To model sensor data, components of the OGC SensorThings model such as data streams and observations have been incorporated in this approach. Machine learning models can be integrated into the information model in terms of existing model serving frameworks like PMML or Tensorflowgraph. Based on the proposed information model, a tool chain for automatic knowledge extraction is introduced and the automatic classification of unstructured text is investigated as a particular application case for the proposed tool chain.

Download Full-text

Collective annotation patterns in learning from crowds

Intelligent Data Analysis ◽

10.3233/ida-200009 ◽

2020 ◽

Vol 24 ◽

pp. 63-86

Author(s):

Francisco Mena ◽

Ricardo Ñanculef ◽

Carlos Valle

Keyword(s):

Machine Learning ◽

Large Scale ◽

Ground Truth ◽

Experimental Results ◽

Ground Truth Data ◽

Satisfactory Performance ◽

Machine Learning Applications ◽

Data Points ◽

Confusion Matrices

The lack of annotated data is one of the major barriers facing machine learning applications today. Learning from crowds, i.e. collecting ground-truth data from multiple inexpensive annotators, has become a common method to cope with this issue. It has been recently shown that modeling the varying quality of the annotations obtained in this way, is fundamental to obtain satisfactory performance in tasks where inexpert annotators may represent the majority but not the most trusted group. Unfortunately, existing techniques represent annotation patterns for each annotator individually, making the models difficult to estimate in large-scale scenarios. In this paper, we present two models to address these problems. Both methods are based on the hypothesis that it is possible to learn collective annotation patterns by introducing confusion matrices that involve groups of data point annotations or annotators. The first approach clusters data points with a common annotation pattern, regardless the annotators from which the labels have been obtained. Implicitly, this method attributes annotation mistakes to the complexity of the data itself and not to the variable behavior of the annotators. The second approach explicitly maps annotators to latent groups that are collectively parametrized to learn a common annotation pattern. Our experimental results show that, compared with other methods for learning from crowds, both methods have advantages in scenarios with a large number of annotators and a small number of annotations per annotator.

Download Full-text

Hypertension Prediction Using Machine Learning Technique

International Journal of Strategic Decision Sciences ◽

10.4018/ijsds.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 52-62

Author(s):

Youngkeun Choi ◽

Jae Choi

Keyword(s):

Machine Learning ◽

Predictive Performance ◽

Learning Technology ◽

Full Model ◽

Accuracy Rate ◽

Medical Problems ◽

Machine Learning Technique ◽

Learning Technique ◽

Average Glucose

Machine learning technology is used in advanced data analysis and optimization approaches for different kinds of medical problems. Hypertension is complicated, and every year it causes a lot of many severe illnesses such as stroke and heart disease. This study essentially had two primary goals. Firstly, this paper intends to understand the role of variables in hypertension modeling better. Secondly, the study seeks to evaluate the predictive performance of the decision trees. Based on these results, first, age, BMI, and average glucose level influence hypertension significantly, while other variables have an influence. Second, for the full model, the accuracy rate is 0.905, which implies that the error rate is 0.095. Among the patients who were predicted not to have hypertension, the accuracy that would not have hypertension was 90.51%, and the accuracy that had strike was 30.77% among the patients who were predicted to have hypertension.

Download Full-text

A mathematical model for describing and predicting the lactation curve of Merino ewes

Animal Science ◽

10.1017/s1357729800013564 ◽

1995 ◽

Vol 61 (1) ◽

pp. 95-101 ◽

Cited By ~ 14

Author(s):

P. C. N. Groenewald ◽

A. V. Ferreira ◽

H. J. van der Merwe ◽

S. C. Slippers

Keyword(s):

Mathematical Model ◽

Economic Evaluations ◽

Lactation Period ◽

Estimated Parameters ◽

Single Data Point ◽

Lactation Curve ◽

Data Points ◽

Single Data ◽

Wood Model ◽

Grossman Model

AbstractThe milk production of 63 5-year-old Merino ewes was measured over a 16-week period after lambing. The purpose was to find a suitable mathematical model to represent the lactation curve of Merino sheep and to estimate the parameters of the model for an individual ewe from a single data point in early lactation. Three models were considered, the three-parameter Wood model, yn = nb exp(a + en), the four-parameter Morant model, yn = exp(a + bn + en2 + d/n and the six-parameter Grossman model, yn = a1b1 - [tanh2 (b1n - c1))] + a2b2[1 – tanh2 (b2(n -c2,))].The Grossman model was found to be inappropriate for the available data, while there seems to be little difference in the suitability of the other two models. The Wood and Morant models both seem adequate to represent the lactation curve. A pattern in the estimated residuals suggests possible autocorrelations in the errors, but this is inconclusive due to the limited number of data points per animal.The correlation between the estimated parameters of the model and the daily yield measured during the 1st week of lactation enabled us to use linear regression to estimate the lactation curve of an individual animal based on the 1st week's yield. Confidence and prediction intervals for the yield during the rest of the lactation period may then also be constructed. This makes it possible to extend incomplete milk records for use in genetic evaluation, formulation of rations and economic evaluations.

Download Full-text

scikit-activeml: A Library and Toolbox for Active Learning Algorithms

10.20944/preprints202103.0194.v1 ◽

2021 ◽

Author(s):

Daniel Kottke ◽

Marek Herde ◽

Tuan Pham Minh ◽

Alexander Benz ◽

Pascal Mergard ◽

...

Keyword(s):

Machine Learning ◽

Active Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Unlabeled Data ◽

Training Data ◽

Partially Labeled Data ◽

Difficult Time ◽

Machine Learning Applications ◽

Data Points

Machine learning applications often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data points that will improve the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels. We provide a library called scikit-activeml that covers the most relevant query strategies and implements tools to work with partially labeled data. It is programmed in Python and builds on top of scikit-learn.

Download Full-text

An End-to-End Rumor Detection Model Based on Feature Aggregation

Complexity ◽

10.1155/2021/6659430 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Aoshuang Ye ◽

Lina Wang ◽

Run Wang ◽

Wenqi Wang ◽

Jianpeng Ke ◽

...

Keyword(s):

Machine Learning ◽

Social Network ◽

Language Processing ◽

Domain Knowledge ◽

Learning Ability ◽

Feature Engineering ◽

Learning Technology ◽

Detection Model ◽

Feature Aggregation ◽

Rumor Detection

The social network has become the primary medium of rumor propagation. Moreover, manual identification of rumors is extremely time-consuming and laborious. It is crucial to identify rumors automatically. Machine learning technology is widely implemented in the identification and detection of misinformation on social networks. However, the traditional machine learning methods profoundly rely on feature engineering and domain knowledge, and the learning ability of temporal features is insufficient. Furthermore, the features used by the deep learning method based on natural language processing are heavily limited. Therefore, it is of great significance and practical value to study the rumor detection method independent of feature engineering and effectively aggregate heterogeneous features to adapt to the complex and variable social network. In this paper, a deep neural network- (DNN-) based feature aggregation modeling method is proposed, which makes full use of the knowledge of propagation pattern feature and text content feature of social network event without feature engineering and domain knowledge. The experimental results show that the feature aggregation model has achieved 94.4% of accuracy as the best performance in recent works.

Download Full-text

Application of Machine Learning in Pipeline Integrity Reliability

Volume 2: Computer Technology and Bolted Joints ◽

10.1115/pvp2019-93623 ◽

2019 ◽

Author(s):

Eunice Yin ◽

Phil Fernandes ◽

Janine Woo ◽

Doug Langer ◽

Sherif Hassanien

Keyword(s):

Machine Learning ◽

Management Practices ◽

Large Population ◽

Classification Model ◽

Learning Technology ◽

Learning Models ◽

Simulation Techniques ◽

Depth Analysis ◽

Machine Learning Applications ◽

Machine Learning Models

Abstract Probabilistic analysis is becoming increasingly adopted by pipeline integrity management practices in recent years. The practice employs reliability engineering methods to address pipeline integrity and safety concerns. At present, the industry is beginning to pair reliability methods with numerical methods to estimate probabilities of failure (PoF) for individual defects, or features, in a pipeline. The effort required for this can be intensive, since it must be repeated on hundreds of thousands of features, which need to be analyzed on a regular basis. This poses a challenge for pipeline reliability engineers, given limited human and computational resources. In the meantime, machine learning applications in many industries have grown significantly due to advancements in algorithms and raw computing power. With massive amounts of raw data available from inline inspection (ILI) tools, and artificial data available through simulation techniques, pipeline integrity reliability becomes a promising field in which to apply machine learning technology to fast-track PoF estimation. Since a large population of reported features have low PoFs and pose low risk to integrity and safety, they can be safely screened out using fast machine learning models to free up engineers who can be dedicated to in-depth analysis of more critical features, which could have a much larger impact on pipeline operational safety. In this paper, two machine learning models are proposed to address the pipeline integrity reliability challenges. The regression model was able to predict features with low PoFs with 99.99% confidence. The classification model was able to conservatively predict PoFs so that no high PoF feature was misclassified as being low PoF, while correctly filtering out 99.6% of the low PoF features. The proposed approach is presented and validated through pipeline integrity simulated case studies.

Download Full-text