Classifying Autism from Crowdsourced Semi-Structured Speech Recordings: A Machine Learning System (Preprint)

2021 ◽  
Author(s):  
Nathan Chi ◽  
Peter Washington ◽  
Aaron Kline ◽  
Arman Husic ◽  
Cathy Hou ◽  
...  

BACKGROUND Autism spectrum disorder (ASD) is a neurodevelopmental disorder which results in altered behavior, social development, and communication patterns. In past years, autism prevalence has tripled, with 1 in 54 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process which requires the work of trained physicians, significant attention has been given to developing systems that automatically diagnose and screen for autism. OBJECTIVE Prosody abnormalities are among the most clear signs of autism, with affected children displaying speech idiosyncrasies (including echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns). In this work, we present a suite of machine learning approaches to detect autism in self-recorded speech audio captured from autistic and neurotypical (NT) children in home environments. METHODS We consider three methods to detect autism in child speech: first, Random Forests trained on extracted audio features (including Mel-frequency cepstral coefficients); second, convolutional neural networks (CNNs) trained on spectrograms; and third, fine-tuned wav2vec 2.0—a state-of-the-art Transformer-based speech recognition model. We train our classifiers on our novel dataset of cellphone-recorded child speech audio curated from Stanford’s Guess What? mobile game, an app designed to crowdsource videos of autistic and neurotypical children in a natural home environment. RESULTS The Random Forest classifier achieves 70% accuracy, the fine-tuned wav2vec 2.0 model achieves 77% accuracy, and the CNN achieves 79% accuracy when classifying children’s audio as either ASD or NT. We use five-fold cross-validation to evaluate model performance. CONCLUSIONS Our models were able to predict autism status when training on a varied selection of home audio clips with inconsistent recording qualities, which may be more generalizable to real world conditions. The results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.

2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Lei Li ◽  
Desheng Wu

PurposeThe infraction of securities regulations (ISRs) of listed firms in their day-to-day operations and management has become one of common problems. This paper proposed several machine learning approaches to forecast the risk at infractions of listed corporates to solve financial problems that are not effective and precise in supervision.Design/methodology/approachThe overall proposed research framework designed for forecasting the infractions (ISRs) include data collection and cleaning, feature engineering, data split, prediction approach application and model performance evaluation. We select Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machines, Artificial Neural Network and Long Short-Term Memory Networks (LSTMs) as ISRs prediction models.FindingsThe research results show that prediction performance of proposed models with the prior infractions provides a significant improvement of the ISRs than those without prior, especially for large sample set. The results also indicate when judging whether a company has infractions, we should pay attention to novel artificial intelligence methods, previous infractions of the company, and large data sets.Originality/valueThe findings could be utilized to address the problems of identifying listed corporates' ISRs at hand to a certain degree. Overall, results elucidate the value of the prior infraction of securities regulations (ISRs). This shows the importance of including more data sources when constructing distress models and not only focus on building increasingly more complex models on the same data. This is also beneficial to the regulatory authorities.


2021 ◽  
Author(s):  
Astrid Rybner ◽  
Emil Trenckner Jessen ◽  
Marie Damsgaard Mortensen ◽  
Stine Nyhus Larsen ◽  
Ruth Grossman ◽  
...  

Background: Machine learning (ML) approaches show increasing promise to identify vocal markers of Autism Spectrum Disorder (ASD). Nonetheless, it is unclear to what extent such markers generalize to new speech samples collected in diverse settings such as using a different speech task or a different language. Aim: In this paper, we systematically assess the generalizability of ML findings across a variety of contexts. Methods: We re-train a promising published ML model of vocal markers of ASD on novel cross-linguistic datasets following a rigorous pipeline to minimize overfitting, including cross-validated training and ensemble models. We test the generalizability of the models by testing them on i) different participants from the same study, performing the same task; ii) the same participants, performing a different (but similar) task; iii) a different study with participants speaking a different language, performing the same type of task. Results: While model performance is similar to previously published findings when trained and tested on data from the same study (out-of-sample performance), there is considerable variance between studies. Crucially, the models do not generalize well to new similar tasks and not at all to new languages. The ML pipeline is openly shared. Conclusion: Generalizability of ML models of vocal markers - and more generally biobehavioral markers - of ASD is an issue. We outline three recommendations researchers could take in order to be more explicit about generalizability and improve it in future studies.


Author(s):  
Sachin Kumar ◽  
Karan Veer

Aims: The objective of this research is to predict the covid-19 cases in India based on the machine learning approaches. Background: Covid-19, a respiratory disease caused by one of the coronavirus family members, has led to a pandemic situation worldwide in 2020. This virus was detected firstly in Wuhan city of China in December 2019. This viral disease has taken less than three months to spread across the globe. Objective: In this paper, we proposed a regression model based on the Support vector machine (SVM) to forecast the number of deaths, the number of recovered cases, and total confirmed cases for the next 30 days. Method: For prediction, the data is collected from Github and the ministry of India's health and family welfare from March 14, 2020, to December 3, 2020. The model has been designed in Python 3.6 in Anaconda to forecast the forecasting value of corona trends until September 21, 2020. The proposed methodology is based on the prediction of values using SVM based regression model with polynomial, linear, rbf kernel. The dataset has been divided into train and test datasets with 40% and 60% test size and verified with real data. The model performance parameters are evaluated as a mean square error, mean absolute error, and percentage accuracy. Results and Conclusion: The results show that the polynomial model has obtained 95 % above accuracy score, linear scored above 90%, and rbf scored above 85% in predicting cumulative death, conformed cases, and recovered cases.


2020 ◽  
Vol 12 (1) ◽  
Author(s):  
M. Withnall ◽  
E. Lindelöf ◽  
O. Engkvist ◽  
H. Chen

AbstractNeural Message Passing for graphs is a promising and relatively recent approach for applying Machine Learning to networked data. As molecules can be described intrinsically as a molecular graph, it makes sense to apply these techniques to improve molecular property prediction in the field of cheminformatics. We introduce Attention and Edge Memory schemes to the existing message passing neural network framework, and benchmark our approaches against eight different physical–chemical and bioactivity datasets from the literature. We remove the need to introduce a priori knowledge of the task and chemical descriptor calculation by using only fundamental graph-derived properties. Our results consistently perform on-par with other state-of-the-art machine learning approaches, and set a new standard on sparse multi-task virtual screening targets. We also investigate model performance as a function of dataset preprocessing, and make some suggestions regarding hyperparameter selection.


Author(s):  
Joseph McGrath ◽  
Jonathon Neville ◽  
Tom Stewart ◽  
John Cronin

Inertial measurement units (IMUs) are becoming increasingly popular in activity classification and workload measurement in sport. This systematic literature review focuses on upper body activity classification in court or field-based sports. The aim of this paper is to provide sport scientists and coaches with an overview of the past research in this area, as well as the processes and challenges involved in activity classification. The SPORTDiscus, PubMed and Scopus databases were searched, resulting in 20 articles. Both manually defined algorithms and machine learning approaches have been used to classify IMU data with varying degrees of success. Manually defined algorithms may offer simplicity and reduced computational demand, whereas machine learning may be beneficial for complex classification problems. Inter-study results show that no one machine learning model is best for activity classification; differences in sensor placement, IMU specification and pre-processing decisions can all affect model performance. Accurate classification of sporting activities could benefit players, coaches and team medical personnel by providing an objective estimate of workload. This could help to prevent injuries, enhance performance and provide valuable data to coaching staff.


Science ◽  
2021 ◽  
Vol 371 (6535) ◽  
pp. eabe8628
Author(s):  
Marshall Burke ◽  
Anne Driscoll ◽  
David B. Lobell ◽  
Stefano Ermon

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and improving resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of model performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight research directions for the field.


2020 ◽  
Vol 9 (2) ◽  
pp. 111-118
Author(s):  
Shindy Arti ◽  
Indriana Hidayah ◽  
Sri Suning Kusumawardhani

Machine learning is commonly used to predict and implement  pattern recognition and the relationship between variables. Causal machine learning combines approaches for analyzing the causal impact of intervention on the result, asumming a considerably ambigous variables. The combination technique of causality and machine learning is adequate for predicting and understanding the cause and effect of the results. The aim of this study is a systematic review to identify which causal machine learning approaches are generally used. This paper focuses on what data characteristics are applied to causal machine learning research and how to assess the output of algorithms used in the context of causal machine learning research. The review paper analyzes 20 papers with various approaches. This study categorizes data characteristics based on the type of data, attribute value, and the data dimension. The Bayesian Network (BN) commonly used in the context of causality. Meanwhile, the propensity score is the most extensively used in causality research. The variable value will affect algorithm performance. This review can be as a guide in the selection of a causal machine learning system.


2021 ◽  
Author(s):  
Afef Saihi ◽  
Hussam Alshraideh

Autism spectrum disorder ASD is a neurodevelopmental disorder associated with challenges in communication, social interaction, and repetitive behaviors. Getting a clear diagnosis for a child is necessary for starting early intervention and having access to therapy services. However, there are many barriers that hinder the screening of these kids for autism at an early stage which might delay further the access to therapeutic interventions. One promising direction for improving the efficiency and accuracy of ASD detection in toddlers is the use of machine learning techniques to build classifiers that serve the purpose. This paper contributes to this area and uses the data developed by Dr. Fadi Fayez Thabtah to train and test various machine learning classifiers for the early ASD screening. Based on various attributes, three models have been trained and compared which are Decision tree C4.5, Random Forest, and Neural Network. The three models provided very good accuracies based on testing data, however, it is the Neural Network that outperformed the other two models. This work contributes to the early screening of toddlers by helping identify those who have ASD traits and should pursue formal clinical diagnosis.


Sensors ◽  
2019 ◽  
Vol 19 (2) ◽  
pp. 313 ◽  
Author(s):  
Pengbo Gao ◽  
Yan Zhang ◽  
Linhuan Zhang ◽  
Ryozo Noguchi ◽  
Tofael Ahamed

Unmanned aerial vehicle (UAV)-based spraying systems have recently become important for the precision application of pesticides, using machine learning approaches. Therefore, the objective of this research was to develop a machine learning system that has the advantages of high computational speed and good accuracy for recognizing spray and non-spray areas for UAV-based sprayers. A machine learning system was developed by using the mutual subspace method (MSM) for images collected from a UAV. Two target lands: agricultural croplands and orchard areas, were considered in building two classifiers for distinguishing spray and non-spray areas. The field experiments were conducted in target areas to train and test the system by using a commercial UAV (DJI Phantom 3 Pro) with an onboard 4K camera. The images were collected from low (5 m) and high (15 m) altitudes for croplands and orchards, respectively. The recognition system was divided into offline and online systems. In the offline recognition system, 74.4% accuracy was obtained for the classifiers in recognizing spray and non-spray areas for croplands. In the case of orchards, the average classifier recognition accuracy of spray and non-spray areas was 77%. On the other hand, the online recognition system performance had an average accuracy of 65.1% for croplands, and 75.1% for orchards. The computational time for the online recognition system was minimal, with an average of 0.0031 s for classifier recognition. The developed machine learning system had an average recognition accuracy of 70%, which can be implemented in an autonomous UAV spray system for recognizing spray and non-spray areas for real-time applications.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Maya Varma ◽  
Kelley M. Paskov ◽  
Brianna S. Chrisman ◽  
Min Woo Sun ◽  
Jae-Yoon Jung ◽  
...  

Abstract Background Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders. Results We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier. Conclusion Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.


Sign in / Sign up

Export Citation Format

Share Document