fNIRS-QC: Crowd-Sourced Creation of a Dataset and Machine Learning Model for fNIRS Quality Control

Despite technological advancements in functional Near Infra-Red Spectroscopy (fNIRS) and a rise in the application of the fNIRS in neuroscience experimental designs, the processing of fNIRS data remains characterized by a high number of heterogeneous approaches, implicating the scientific reproducibility and interpretability of the results. For example, a manual inspection is still necessary to assess the quality and subsequent retention of collected fNIRS signals for analysis. Machine Learning (ML) approaches are well-positioned to provide a unique contribution to fNIRS data processing by automating and standardizing methodological approaches for quality control, where ML models can produce objective and reproducible results. However, any successful ML application is grounded in a high-quality dataset of labeled training data, and unfortunately, no such dataset is currently available for fNIRS signals. In this work, we introduce fNIRS-QC, a platform designed for the crowd-sourced creation of a quality control fNIRS dataset. In particular, we (a) composed a dataset of 4385 fNIRS signals; (b) created a web interface to allow multiple users to manually label the signal quality of 510 10 s fNIRS segments. Finally, (c) a subset of the labeled dataset is used to develop a proof-of-concept ML model to automatically assess the quality of fNIRS signals. The developed ML models can serve as a more objective and efficient quality control check that minimizes error from manual inspection and the need for expertise with signal quality control.

Download Full-text

Data Balancing Method for Training Segmentation Neural Networks

10.51130/graphicon-2020-2-4-19 ◽

2020 ◽

pp. short19-1-short19-9

Author(s):

Alexey Kochkarev ◽

Alexander Khvostikov ◽

Dmitry Korshunov ◽

Andrey Krylov ◽

Mikhail Boguslavskiy

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Training Data ◽

Learning Ability ◽

Unbalanced Data ◽

Distance Transform ◽

Machine Learning Model ◽

Overall Performance ◽

Medical Dataset

Data imbalance is a common problem in machine learning and image processing. The lack of training data for the rarest classes can lead to worse learning ability and negatively affect the quality of segmentation. In this paper, we focus on the problem of data balancing for the task of image segmentation. We review major trends in handling unbalanced data and propose a new method for data balancing, based on Distance Transform. This method is designed for using in segmentation convolutional neural networks (CNNs), but it is universal and can be used with any patch-based segmentation machine learning model. The evaluation of the proposed data balancing method is performed on two datasets. The first is medical dataset LiTS, containing CT images of liver with tumor abnormalities. The second one is a geological dataset, containing of photographs of polished sections of different ores. The proposed algorithm enhances the data balance between classes and improves the overall performance of CNN model.

Download Full-text

DeepGOZero: Improving protein function prediction from sequence and zero-shot learning based on ontology axioms

10.1101/2022.01.14.476325 ◽

2022 ◽

Author(s):

Maxat Kulmanov ◽

Robert Hoehndorf

Keyword(s):

Machine Learning ◽

Protein Function ◽

Protein Function Prediction ◽

Prediction Method ◽

Function Prediction ◽

Training Data ◽

Large Set ◽

Theoretic Approach ◽

Machine Learning Model ◽

Protein Functions

Motivation: Protein functions are often described using the Gene Ontology (GO) which is an ontology consisting of over 50,000 classes and a large set of formal axioms. Predicting the functions of proteins is one of the key challenges in computational biology and a variety of machine learning methods have been developed for this purpose. However, these methods usually require significant amount of training data and cannot make predictions for GO classes which have only few or no experimental annotations. Results: We developed DeepGOZero, a machine learning model which improves predictions for functions with no or only a small number of annotations. To achieve this goal, we rely on a model-theoretic approach for learning ontology embeddings and combine it with neural networks for protein function prediction. DeepGOZero can exploit formal axioms in the GO to make zero-shot predictions, i.e., predict protein functions even if not a single protein in the training phase was associated with that function. Furthermore, the zero-shot prediction method employed by DeepGOZero is generic and can be applied whenever associations with ontology classes need to be predicted. Availability: http://github.com/bio-ontology-research-group/deepgozero

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

Machine Learning-Based Subjective Quality Estimation for Video Streaming Over Wireless Networks

Advances in Wireless Technologies and Telecommunication - Next-Generation Wireless Networks Meet Advanced Machine Learning Applications ◽

10.4018/978-1-5225-7458-3.ch010 ◽

2019 ◽

pp. 235-254

Author(s):

Monalisa Ghosh ◽

Chetna Singhal

Keyword(s):

Machine Learning ◽

Wireless Networks ◽

Video Streaming ◽

Video Transmission ◽

Video Quality ◽

Subjective Quality ◽

Perceptual Quality ◽

Support Vector ◽

Machine Learning Model

Video streaming services top the internet traffic surging forward a competitive environment to impart best quality of experience (QoE) to the users. The standard codecs utilized in video transmission systems eliminate the spatiotemporal redundancies in order to decrease the bandwidth requirement. This may adversely affect the perceptual quality of videos. To rate a video quality both subjective and objective parameters can be used. So, it is essential to construct frameworks which will measure integrity of video just like humans. This chapter focuses on application of machine learning to evaluate the QoE without requiring human efforts with higher accuracy of 86% and 91% employing the linear and support vector regression respectively. Machine learning model is developed to forecast the subjective quality of H.264 videos obtained after streaming through wireless networks from the subjective scores.

Download Full-text

Convolutional Neural Network

10.4018/978-1-6684-2408-7.ch077 ◽

2022 ◽

pp. 1559-1575

Author(s):

Mário Pereira Véstias

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Artificial Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

Machine Learning Algorithms ◽

Training Data ◽

Machine Learning Model ◽

Artificial Neural

Machine learning is the study of algorithms and models for computing systems to do tasks based on pattern identification and inference. When it is difficult or infeasible to develop an algorithm to do a particular task, machine learning algorithms can provide an output based on previous training data. A well-known machine learning model is deep learning. The most recent deep learning models are based on artificial neural networks (ANN). There exist several types of artificial neural networks including the feedforward neural network, the Kohonen self-organizing neural network, the recurrent neural network, the convolutional neural network, the modular neural network, among others. This article focuses on convolutional neural networks with a description of the model, the training and inference processes and its applicability. It will also give an overview of the most used CNN models and what to expect from the next generation of CNN models.

Download Full-text

Combine Clustering and Machine Learning for Enhancing the Efficiency of Energy Baseline of Chiller System

Energies ◽

10.3390/en13174368 ◽

2020 ◽

Vol 13 (17) ◽

pp. 4368 ◽

Cited By ~ 1

Author(s):

Chun-Wei Chen ◽

Chun-Chang Li ◽

Chen-Yu Lin

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Prediction Models ◽

Machine Learning Algorithms ◽

Learning Models ◽

Important Method ◽

Gap Statistic ◽

Machine Learning Model ◽

Key Variables

Energy baseline is an important method for measuring the energy-saving benefits of chiller system, and the benefits can be calculated by comparing prediction models and actual results. Currently, machine learning is often adopted as a prediction model for energy baselines. Common models include regression, ensemble learning, and deep learning models. In this study, we first reviewed several machine learning algorithms, which were used to establish prediction models. Then, the concept of clustering to preprocess chiller data was adopted. Data mining, K-means clustering, and gap statistic were used to successfully identify the critical variables to cluster chiller modes. Applying these key variables effectively enhanced the quality of the chiller data, and combining the clustering results and the machine learning model effectively improved the prediction accuracy of the model and the reliability of the energy baselines.

Download Full-text

P1060USING ARTIFICIAL INTELLIGENCE TO PREDICT HOME THERAPY CANDIDATES

Nephrology Dialysis Transplantation ◽

10.1093/ndt/gfaa142.p1060 ◽

2020 ◽

Vol 35 (Supplement_3) ◽

Author(s):

Jerry Yu ◽

Andrew Long ◽

Maria Hanson ◽

Aleetha Ellis ◽

Michael Macarthur ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Feedback Loop ◽

Area Under The Curve ◽

Patient Characteristics ◽

Training Data ◽

Quality Improvement Initiative ◽

Home Therapy ◽

Test Dataset ◽

Machine Learning Model

Abstract Background and Aims There are many benefits for performing dialysis at home including more flexibility and more frequent treatments. A possible barrier to election of home therapy (HT) by in-center patients is a lack of adequate HT education. To aid efficient education efforts, a predictive model was developed to help identify patients who are more likely to switch from in-center and succeed on HT. Method We developed a model using machine learning to predict which patients who are treated in-center without prior HT history are most likely to switch to HT in the next 90 days and stay on HT for at least 90 days. Training data was extracted from 2016–2019 for approximately 300,000 patients. We randomly sampled one in-center treatment date per patient and determined if the patient would switch and succeed on HT. The input features consisted of treatment vitals, laboratories, absence history, comprehensive assessments, facility information, county-level housing, and patient characteristics. Patients were excluded if they had less than 30 days on dialysis due to lack of data. A machine learning model (XGBoost classifier) was deployed monthly in a pilot with a team of HT educators to investigate the model’s utility for identifying HT candidates. Results There were approximately 1,200 patients starting a home therapy per month in a large dialysis provider, with approximately one-third being in-center patients. The prevalence of switching and succeeding to HT in this population was 2.54%. The predictive model achieved an area under the curve of 0.87, sensitivity of 0.77, and a specificity of 0.80 on a hold-out test dataset. The pilot was successfully executed for several months and two major lessons were learned: 1) some patients who reappeared on each month’s list should be removed from the list after expressing no interest in HT, and 2) a data collection mechanism should be put in place to capture the reasons why patients are not interested in HT. Conclusion This quality-improvement initiative demonstrates that predictive modeling can be used to identify patients likely to switch and succeed on home therapy. Integration of the model in existing workflows requires creating a feedback loop which can help improve future worklists.

Download Full-text

Improving the Quality of Positive Datasets for the Establishment of Machine Learning Models for pre-microRNA Detection

Journal of Integrative Bioinformatics ◽

10.1515/jib-2017-0032 ◽

2017 ◽

Vol 14 (2) ◽

Author(s):

Müşerref Duygu Saçar Demirci ◽

Jens Allmer

Keyword(s):

Machine Learning ◽

Training Data ◽

Quality Data ◽

Virus Infections ◽

High Quality ◽

Disease States ◽

Positive Data ◽

Mirna Detection ◽

Post Transcriptional Regulation

AbstractMicroRNAs (miRNAs) are involved in the post-transcriptional regulation of protein abundance and thus have a great impact on the resulting phenotype. It is, therefore, no wonder that they have been implicated in many diseases ranging from virus infections to cancer. This impact on the phenotype leads to a great interest in establishing the miRNAs of an organism. Experimental methods are complicated which led to the development of computational methods for pre-miRNA detection. Such methods generally employ machine learning to establish models for the discrimination between miRNAs and other sequences. Positive training data for model establishment, for the most part, stems from miRBase, the miRNA registry. The quality of the entries in miRBase has been questioned, though. This unknown quality led to the development of filtering strategies in attempts to produce high quality positive datasets which can lead to a scarcity of positive data. To analyze the quality of filtered data we developed a machine learning model and found it is well able to establish data quality based on intrinsic measures. Additionally, we analyzed which features describing pre-miRNAs could discriminate between low and high quality data. Both models are applicable to data from miRBase and can be used for establishing high quality positive data. This will facilitate the development of better miRNA detection tools which will make the prediction of miRNAs in disease states more accurate. Finally, we applied both models to all miRBase data and provide the list of high quality hairpins.

Download Full-text

Privacy-Preserving Gradient Boosting Decision Trees

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5422 ◽

2020 ◽

Vol 34 (01) ◽

pp. 784-791 ◽

Cited By ~ 1

Author(s):

Qinbin Li ◽

Zhaomin Wu ◽

Zeyi Wen ◽

Bingsheng He

Keyword(s):

Machine Learning ◽

Differential Privacy ◽

Training Data ◽

Gradient Boosting ◽

Training Algorithm ◽

Model Accuracy ◽

Machine Learning Model ◽

Improve Model ◽

Privacy Budget ◽

Privacy Level

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.

Download Full-text

A Novel XGBoost Method to Infer the Primary Lesion of 20 Solid Tumor Types From Gene Expression Data

Frontiers in Genetics ◽

10.3389/fgene.2021.632761 ◽

2021 ◽

Vol 12 ◽

Author(s):

Sijie Chen ◽

Wenjing Zhou ◽

Jinghui Tu ◽

Jian Li ◽

Bo Wang ◽

...

Keyword(s):

Machine Learning ◽

Learning Model ◽

Training Data ◽

Diagnostic Efficiency ◽

Metastatic Tumors ◽

Pathological Conditions ◽

Machine Learning Model ◽

Independent Test ◽

Tumor Types ◽

Fold Cross Validation

PurposeEstablish a suitable machine learning model to identify its primary lesions for primary metastatic tumors in an integrated learning approach, making it more accurate to improve primary lesions’ diagnostic efficiency.MethodsAfter deleting the features whose expression level is lower than the threshold, we use two methods to perform feature selection and use XGBoost for classification. After the optimal model is selected through 10-fold cross-validation, it is verified on an independent test set.ResultsSelecting features with around 800 genes for training, theR2-score of a 10-fold CV of training data can reach 96.38%, and theR2-score of test data can reach 83.3%.ConclusionThese findings suggest that by combining tumor data with machine learning methods, each cancer has its corresponding classification accuracy, which can be used to predict primary metastatic tumors’ location. The machine-learning-based method can be used as an orthogonal diagnostic method to judge the machine learning model processing and clinical actual pathological conditions.

Download Full-text