Machine Learning Crowdfunding

2020 ◽  
Vol 10 (2) ◽  
pp. 1-11
Author(s):  
Evangelos Katsamakas ◽  
Hao Sun

Crowdfunding is a novel and important economic mechanism for funding projects and promoting innovation in the digital economy. This article explores most recent structured and unstructured data from a crowdfunding platform. It provides an in-depth exploration of the data using text analytics techniques, such as sentiment analysis and topic modeling. It uses novel natural language processing to represent project descriptions, and evaluates machine learning models, including neural network models, to predict project fundraising success. It discusses the findings of the performance evaluation, and summarizes lessons for crowdfunding platforms and their users.

The purpose of the research described in this article is a comparative analysis of the predictive qualities of some models of machine learning and regression. The factors for models are the consumer characteristics of a used car: brand, transmission type, drive type, engine type, mileage, body type, year of manufacture, seller's region in Ukraine, condition of the car, information about accident, average price for analogue in Ukraine, engine volume, quantity of doors, availability of extra equipment, quantity of passenger’s seats, the first registration of a car, car was driven from abroad or not. Qualitative variables has been encoded as binary variables or by mean target encoding. The information about more than 200 thousand cars have been used for modeling. All models have been evaluated in the Python Software using Sklearn, Catboost, StatModels and Keras libraries. The following regression models and machine learning models were considered in the course of the study: linear regression; polynomial regression; decision tree; neural network; models based on "k-nearest neighbors", "random forest", "gradient boosting" algorithms; ensemble of models. The article presents the best in terms of quality (according to the criteria R2, MAE, MAD, MAPE) options from each class of models. It has been found that the best way to predict the price of a passenger car is through non-linear models. The results of the modeling show that the dependence between the price of a car and its characteristics is best described by the ensemble of models, which includes a neural network, models using "random forest" and "gradient boosting" algorithms. The ensemble of models showed an average relative approximation error of 11.2% and an average relative forecast error of 14.34%. All nonlinear models for car price have approximately the same predictive qualities (the difference between the MAPE within 2%) in this research.


2021 ◽  
Author(s):  
Mohammed Ayub ◽  
SanLinn Kaka

Abstract Manual first-break picking from a large volume of seismic data is extremely tedious and costly. Deployment of machine learning models makes the process fast and cost effective. However, these machine learning models require high representative and effective features for accurate automatic picking. Therefore, First- Break (FB) picking classification model that uses effective minimum number of features and promises performance efficiency is proposed. The variants of Recurrent Neural Networks (RNNs) such as Long ShortTerm Memory (LSTM) and Gated Recurrent Unit (GRU) can retain contextual information from long previous time steps. We deploy this advantage for FB picking as seismic traces are amplitude values of vibration along the time-axis. We use behavioral fluctuation of amplitude as input features for LSTM and GRU. The models are trained on noisy data and tested for generalization on original traces not seen during the training and validation process. In order to analyze the real-time suitability, the performance is benchmarked using accuracy, F1-measure and three other established metrics. We have trained two RNN models and two deep Neural Network models for FB classification using only amplitude values as features. Both LSTM and GRU have the accuracy and F1-measure with a score of 94.20%. With the same features, Convolutional Neural Network (CNN) has an accuracy of 93.58% and F1-score of 93.63%. Again, Deep Neural Network (DNN) model has scores of 92.83% and 92.59% as accuracy and F1-measure, respectively. From the pexperiment results, we see significant superior performance of LSTM and GRU to CNN and DNN when used the same features. For robustness of LSTM and GRU models, the performance is compared with DNN model that is trained using nine features derived from seismic traces and observed that the performance superiority of RNN models. Therefore, it is safe to conclude that RNN models (LSTM and GRU) are capable of classifying the FB events efficiently even by using a minimum number of features that are not computationally expensive. The novelty of our work is the capability of automatic FB classification with the RNN models that incorporate contextual behavioral information without the need for sophisticated feature extraction or engineering techniques that in turn can help in reducing the cost and fostering classification model robust and faster.


2020 ◽  
pp. 1-22 ◽  
Author(s):  
D. Sykes ◽  
A. Grivas ◽  
C. Grover ◽  
R. Tobin ◽  
C. Sudlow ◽  
...  

Abstract Using natural language processing, it is possible to extract structured information from raw text in the electronic health record (EHR) at reasonably high accuracy. However, the accurate distinction between negated and non-negated mentions of clinical terms remains a challenge. EHR text includes cases where diseases are stated not to be present or only hypothesised, meaning a disease can be mentioned in a report when it is not being reported as present. This makes tasks such as document classification and summarisation more difficult. We have developed the rule-based EdIE-R-Neg, part of an existing text mining pipeline called EdIE-R (Edinburgh Information Extraction for Radiology reports), developed to process brain imaging reports, (https://www.ltg.ed.ac.uk/software/edie-r/) and two machine learning approaches; one using a bidirectional long short-term memory network and another using a feedforward neural network. These were developed on data from the Edinburgh Stroke Study (ESS) and tested on data from routine reports from NHS Tayside (Tayside). Both datasets consist of written reports from medical scans. These models are compared with two existing rule-based models: pyConText (Harkema et al. 2009. Journal of Biomedical Informatics42(5), 839–851), a python implementation of a generalisation of NegEx, and NegBio (Peng et al. 2017. NegBio: A high-performance tool for negation and uncertainty detection in radiology reports. arXiv e-prints, p. arXiv:1712.05898), which identifies negation scopes through patterns applied to a syntactic representation of the sentence. On both the test set of the dataset from which our models were developed, as well as the largely similar Tayside test set, the neural network models and our custom-built rule-based system outperformed the existing methods. EdIE-R-Neg scored highest on F1 score, particularly on the test set of the Tayside dataset, from which no development data were used in these experiments, showing the power of custom-built rule-based systems for negation detection on datasets of this size. The performance gap of the machine learning models to EdIE-R-Neg on the Tayside test set was reduced through adding development Tayside data into the ESS training set, demonstrating the adaptability of the neural network models.


Author(s):  
Son

Extracting keywords from documents is an essential task in natural language processing. A challenge of this task is to define a reasonable set of keywords from which we can find all relevant documents. This paper proposes a new approach that exploits word-level handcrafted features and machine learning models to select a single document's most important keywords. To evaluate the proposed solution, we compare our results with the latest supervised and unsupervised automatic keyword extraction methods. Experiment results show that our model achieves the best results on the 9/20 data corpus. It points out that our proposed approach is promising.


2020 ◽  
Vol 12 (6) ◽  
pp. 962 ◽  
Author(s):  
Changyu Liu ◽  
Xiaodong Huang ◽  
Xubing Li ◽  
Tiangang Liang

To improve the poor accuracy of the MODIS (Moderate Resolution Imaging Spectroradiometer) daily fractional snow cover product over the complex terrain of the Tibetan Plateau (RMSE = 0.30), unmanned aerial vehicle and machine learning technologies are employed to map the fractional snow cover based on MODIS over this terrain. Three machine learning models, including random forest, support vector machine, and back-propagation artificial neural network models, are trained and compared in this study. The results indicate that compared with the MODIS daily fractional snow cover product, the introduction of a highly accurate snow map acquired by unmanned aerial vehicles as a reference into machine learning models can significantly improve the MODIS fractional snow cover mapping accuracy. The random forest model shows the best accuracy among the three machine learning models, with an RMSE (root-mean-square error) of 0.23, especially over forestland and shrubland, with RMSEs of 0.13 and 0.18, respectively. Although the accuracy of the support vector machine and back-propagation artificial neural network models are worse over forestland and shrubland, their average errors are still better than that of MOD10A1. Different fractional snow cover gradients also affect the accuracy of the machine learning algorithms. Nevertheless, the random forest model remains stable in different fractional snow cover gradients and is, therefore, the best machine learning algorithm for MODIS fractional snow cover mapping in Tibetan Plateau areas with complex terrain and severely fragmented snow cover.


2021 ◽  
Author(s):  
Flávio Arthur Oliveira Santos ◽  
Cleber Zanchettin ◽  
Leonardo Nogueira Matos ◽  
Paulo Novais

Abstract Robustness is a significant constraint in machine learning models. The performance of the algorithms must not deteriorate when training and testing with slightly different data. Deep neural network models achieve awe-inspiring results in a wide range of applications of computer vision. Still, in the presence of noise or region occlusion, some models exhibit inaccurate performance even with data handled in training. Besides, some experiments suggest deep learning models sometimes use incorrect parts of the input information to perform inference. Active image augmentation (ADA) is an augmentation method that uses interpretability methods to augment the training data and improve its robustness to face the described problems. Although ADA presented interesting results, its original version only used the vanilla backpropagation interpretability to train the U-Net model. In this work, we propose an extensive experimental analysis of the interpretability method’s impact on ADA. We use five interpretability methods: vanilla backpropagation, guided backpropagation, gradient-weighted class activation mapping (GradCam), guided GradCam and InputXGradient. The results show that all methods achieve similar performance at the ending of training, but when combining ADA with GradCam, the U-Net model presented an impressive fast convergence.


2019 ◽  
Vol 45 (2) ◽  
pp. 293-337 ◽  
Author(s):  
Hao Zhang ◽  
Richard Sproat ◽  
Axel H. Ng ◽  
Felix Stahlberg ◽  
Xiaochang Peng ◽  
...  

Machine learning, including neural network techniques, have been applied to virtually every domain in natural language processing. One problem that has been somewhat resistant to effective machine learning solutions is text normalization for speech applications such as text-to-speech synthesis (TTS). In this application, one must decide, for example, that 123 is verbalized as one hundred twenty three in 123 pages but as one twenty three in 123 King Ave. For this task, state-of-the-art industrial systems depend heavily on hand-written language-specific grammars. We propose neural network models that treat text normalization for TTS as a sequence-to-sequence problem, in which the input is a text token in context, and the output is the verbalization of that token. We find that the most effective model, in accuracy and efficiency, is one where the sentential context is computed once and the results of that computation are combined with the computation of each token in sequence to compute the verbalization. This model allows for a great deal of flexibility in terms of representing the context, and also allows us to integrate tagging and segmentation into the process. These models perform very well overall, but occasionally they will predict wildly inappropriate verbalizations, such as reading 3 cm as three kilometers. Although rare, such verbalizations are a major issue for TTS applications. We thus use finite-state covering grammars to guide the neural models, either during training and decoding, or just during decoding, away from such “unrecoverable” errors. Such grammars can largely be learned from data.


2021 ◽  
Author(s):  
Abul Hasan ◽  
Mark Levene ◽  
David Weston ◽  
Renate Fromson ◽  
Nicolas Koslover ◽  
...  

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.


2020 ◽  
Author(s):  
Christopher A Hane ◽  
Vijay S Nori ◽  
William H Crown ◽  
Darshak M Sanghavi ◽  
Paul Bleicher

BACKGROUND Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. OBJECTIVE This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. METHODS We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. RESULTS When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes’ volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. CONCLUSIONS Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy.


Sign in / Sign up

Export Citation Format

Share Document