SimTyper: sound type inference for Ruby using type equality prediction

Many researchers have explored type inference for dynamic languages. However, traditional type inference computes most general types which, for complex type systems—which are often needed to type dynamic languages—can be verbose, complex, and difficult to understand. In this paper, we introduce SimTyper, a Ruby type inference system that aims to infer usable types—specifically, nominal and generic types—that match the types programmers write. SimTyper builds on InferDL, a recent Ruby type inference system that soundly combines standard type inference with heuristics. The key novelty of SimTyper is type equality prediction , a new, machine learning-based technique that predicts when method arguments or returns are likely to have the same type. SimTyper finds pairs of positions that are predicted to have the same type yet one has a verbose, overly general solution and the other has a usable solution. It then guesses the two types are equal, keeping the guess if it is consistent with the rest of the program, and discarding it if not. In this way, types inferred by SimTyper are guaranteed to be sound. To perform type equality prediction, we introduce the deep similarity (DeepSim) neural network. DeepSim is a novel machine learning classifier that follows the Siamese network architecture and uses CodeBERT, a pre-trained model, to embed source tokens into vectors that capture tokens and their contexts. DeepSim is trained on 100,000 pairs labeled with type similarity information extracted from 371 Ruby programs with manually documented, but not checked, types. We evaluated SimTyper on eight Ruby programs and found that, compared to standard type inference, SimTyper finds 69% more types that match programmer-written type information. Moreover, DeepSim can predict rare types that appear neither in the Ruby standard library nor in the training data. Our results show that type equality prediction can help type inference systems effectively produce more usable types.

Download Full-text

Die ontwikkeling van ’n woordafbreker en kompositumanaliseerder vir Afrikaans

Literator ◽

10.4102/lit.v29i1.99 ◽

2008 ◽

Vol 29 (1) ◽

pp. 21-42 ◽

Cited By ~ 1

Author(s):

S. Pilon ◽

M.J. Puttkammer ◽

G.B. Van Huyssteen

Keyword(s):

Machine Learning ◽

Training Data ◽

Practical Implementation ◽

Manual Annotation ◽

Machine Learning Technique ◽

Rule Based ◽

The Core ◽

Learning Classifier ◽

Learning Technique ◽

Rule Based Approach

The development of a hyphenator and compound analyser for Afrikaans The development of two core-technologies for Afrikaans, viz. a hyphenator and a compound analyser is described in this article. As no annotated Afrikaans data existed prior to this project to serve as training data for a machine learning classifier, the core-technologies in question are first developed using a rule-based approach. The rule-based hyphenator and compound analyser are evaluated and the hyphenator obtains an fscore of 90,84%, while the compound analyser only reaches an f-score of 78,20%. Since these results are somewhat disappointing and/or insufficient for practical implementation, it was decided that a machine learning technique (memory-based learning) will be used instead. Training data for each of the two core-technologies is then developed using “TurboAnnotate”, an interface designed to improve the accuracy and speed of manual annotation. The hyphenator developed using machine learning has been trained with 39 943 words and reaches an fscore of 98,11% while the f-score of the compound analyser is 90,57% after being trained with 77 589 annotated words. It is concluded that machine learning (specifically memory-based learning) seems an appropriate approach for developing coretechnologies for Afrikaans.

Download Full-text

Arabic tweets sentiment analysis – a hybrid scheme

Journal of Information Science ◽

10.1177/0165551515610513 ◽

2016 ◽

Vol 42 (6) ◽

pp. 782-797 ◽

Cited By ~ 42

Author(s):

Haifa K. Aldayel ◽

Aqil M. Azmi

Keyword(s):

Machine Learning ◽

Saudi Arabia ◽

Hybrid Approach ◽

Training Data ◽

Machine Learning Techniques ◽

Good Source ◽

Learning Classifier ◽

Learning Techniques ◽

Semantic Orientation ◽

F Measure

The fact that people freely express their opinions and ideas in no more than 140 characters makes Twitter one of the most prevalent social networking websites in the world. Being popular in Saudi Arabia, we believe that tweets are a good source to capture the public’s sentiment, especially since the country is in a fractious region. Going over the challenges and the difficulties that the Arabic tweets present – using Saudi Arabia as a basis – we propose our solution. A typical problem is the practice of tweeting in dialectical Arabic. Based on our observation we recommend a hybrid approach that combines semantic orientation and machine learning techniques. Through this approach, the lexical-based classifier will label the training data, a time-consuming task often prepared manually. The output of the lexical classifier will be used as training data for the SVM machine learning classifier. The experiments show that our hybrid approach improved the F-measure of the lexical classifier by 5.76% while the accuracy jumped by 16.41%, achieving an overall F-measure and accuracy of 84 and 84.01% respectively.

Download Full-text

HistoFlow: Label-Efficient and Interactive Deep Learning Cell Analysis

10.1101/2020.07.15.204891 ◽

2020 ◽

Author(s):

Tim Henning ◽

Benjamin Bergner ◽

Christoph Lippert

Keyword(s):

Machine Learning ◽

Deep Learning ◽

User Interface ◽

Network Architecture ◽

Training Data ◽

Cell Segmentation ◽

Annotation Tool ◽

Cell Analysis ◽

Instance Segmentation

Instance segmentation is a common task in quantitative cell analysis. While there are many approaches doing this using machine learning, typically, the training process requires a large amount of manually annotated data. We present HistoFlow, a software for annotation-efficient training of deep learning models for cell segmentation and analysis with an interactive user interface.It provides an assisted annotation tool to quickly draw and correct cell boundaries and use biomarkers as weak annotations. It also enables the user to create artificial training data to lower the labeling effort. We employ a universal U-Net neural network architecture that allows accurate instance segmentation and the classification of phenotypes in only a single pass of the network. Transfer learning is available through the user interface to adapt trained models to new tissue types.We demonstrate HistoFlow for fluorescence breast cancer images. The models trained using only artificial data perform comparably to those trained with time-consuming manual annotations. They outperform traditional cell segmentation algorithms and match state-of-the-art machine learning approaches. A user test shows that cells can be annotated six times faster than without the assistance of our annotation tool. Extending a segmentation model for classification of epithelial cells can be done using only 50 to 1500 annotations.Our results show that, unlike previous assumptions, it is possible to interactively train a deep learning model in a matter of minutes without many manual annotations.

Download Full-text

Doppler Spread Estimation Based on Machine Learning for an OFDM System

Wireless Communications and Mobile Computing ◽

10.1155/2021/5586029 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Eunchul Yoon ◽

Soonbum Kwon ◽

Unil Yun ◽

Sun-Yong Kim

Keyword(s):

Machine Learning ◽

Network Architecture ◽

Training Data ◽

Doppler Spread ◽

Learning Approach ◽

Ofdm System ◽

Angle Of Arrival ◽

Estimation Errors ◽

Machine Learning Approach ◽

Spread Estimation

In this paper, we propose a Doppler spread estimation approach based on machine learning for an OFDM system. We present a carefully designed neural network architecture to achieve good performance in a mixed-channel scenario in which channel characteristic variables such as Rician K factor, azimuth angle of arrival (AOA) width, mean direction of azimuth AOA, and channel estimation errors are randomly generated. When preprocessing the channel state information (CSI) collected under the mixed-channel scenario, we propose averaged power spectral density (PSD) sequence as high-quality training data in machine learning for Doppler spread estimation. We detail intermediate mathematical derivatives of the machine learning process, making it easy to graft the derived results into other wireless communication technologies. Through simulation, we show that the machine learning approach using the averaged PSD sequence as training data outperforms the other machine learning approach using the channel frequency response (CFR) sequence as training data and two other existing Doppler estimation approaches.

Download Full-text

Real-Time Prediction of Temperature Distribution in Additive Manufacturing Processes Based on Machine Learning

Volume 2A: Advanced Manufacturing ◽

10.1115/imece2020-24107 ◽

2020 ◽

Author(s):

Sharareh Bayat ◽

Mohammad Mohseni ◽

Delaram Behnami ◽

Purang Abolmaesumi

Keyword(s):

Machine Learning ◽

Additive Manufacturing ◽

Real Time ◽

Manufacturing Process ◽

Network Architecture ◽

Short Term Memory ◽

Absolute Error ◽

Training Data ◽

Computational Time ◽

Element Analysis

Abstract Simulation tools improve various aspects of the additive manufacturing process, however, they come with an undesirable computational time for real-world applications. Finite element analysis (FEA) that solves partial differential equations (PDE) presents promising capabilities in simple additive manufactured components as an expository problem. Yet, PDE-based solutions take significantly long CPU time due to a large number of timesteps required to simulate an additively manufactured part. With modern machine learning (ML) capabilities, a new shift towards integration of FEA and ML has been introduced, where ML algorithms emulate the behavior of the time-consuming PDE-solver for real-time analysis of PDE in a given application. In this paper, we present a deep learning (DL) model that can substitute the thermal analysis of the additive manufacturing process. The training data is obtained by sampling the established physical model’s behavior over different temperatures, cooling rates, and part’s geometries. The network architecture is composed of a Long Short-Term Memory (LSTM) to model the temporal sequence of deposition temperatures derived by PDEs. The reported R2 value on validations data is 97%, while the Mean Absolute Error (MAE) is 0.04. This paper compares the performance between the PDE and DL forecast for the thermal results. We show DL models are promising for simulation of the additive manufacturing process, and can be reliable alternatives for computationally-expensive FEM tools.

Download Full-text

Prediction of Potential Future IT Personnel in Bangladesh using Machine Learning Classifier

Global Disclosure of Economics and Business ◽

10.18034/gdeb.v6i1.112 ◽

2017 ◽

Vol 6 (1) ◽

pp. 7-18

Author(s):

Md. Hasnat Parvez ◽

Most. Moriom Khatun ◽

Sayed Mohsin Reza ◽

Md. Mahfujur Rahman ◽

Md. Fazlul Karim Patwary

Keyword(s):

Machine Learning ◽

Random Forest ◽

Direct Analysis ◽

Machine Learning Algorithms ◽

Training Data ◽

Accuracy Measurement ◽

Learning Classifier ◽

Future Potential ◽

Roc Area ◽

It Personnel

Bangladesh is one of the most promising developing countries in IT sector, where people from several disciplines and experiences are involved in this sector. However, no direct analysis in this sector is published yet, which covers the proper guideline for predicting future IT personnel. Hence this is not a simple solution, training data from real IT sector are needed and trained several classifiers for detecting perfect results. Machine learning algorithms can be used for predicting future potential IT personnel. In this paper, four different classifiers named as Naive Bayes, J48, Bagging and Random Forest in five different folds are experimented for that prediction. Results are pointed out that Random Forest performs better accuracy than other experimented classifier for future IT personnel prediction. It is mentioned that the standard accuracy measurement process named as Precision, Recall, F-Measure, ROC Area etc. are used for evaluating the results.

Download Full-text

Supervised and unsupervised machine learning for automated scoring of sleep–wake and cataplexy in a mouse model of narcolepsy

SLEEP ◽

10.1093/sleep/zsz272 ◽

2019 ◽

Vol 43 (5) ◽

Cited By ~ 1

Author(s):

Ioannis Exarchos ◽

Anna A Rogers ◽

Lauren M Aiani ◽

Robert E Gross ◽

Gari D Clifford ◽

...

Keyword(s):

Machine Learning ◽

Rem Sleep ◽

Network Architecture ◽

Training Data ◽

Series Data ◽

Sleep Stages ◽

Automated Scoring ◽

Wild Type ◽

Unseen Data ◽

Unsupervised Approach

Abstract Despite commercial availability of software to facilitate sleep–wake scoring of electroencephalography (EEG) and electromyography (EMG) in animals, automated scoring of rodent models of abnormal sleep, such as narcolepsy with cataplexy, has remained elusive. We optimize two machine-learning approaches, supervised and unsupervised, for automated scoring of behavioral states in orexin/ataxin-3 transgenic mice, a validated model of narcolepsy type 1, and additionally test them on wild-type mice. The supervised learning approach uses previously labeled data to facilitate training of a classifier for sleep states, whereas the unsupervised approach aims to discover latent structure and similarities in unlabeled data from which sleep stages are inferred. For the supervised approach, we employ a deep convolutional neural network architecture that is trained on expert-labeled segments of wake, non-REM sleep, and REM sleep in EEG/EMG time series data. The resulting trained classifier is then used to infer on the labels of previously unseen data. For the unsupervised approach, we leverage data dimensionality reduction and clustering techniques. Both approaches successfully score EEG/EMG data, achieving mean accuracies of 95% and 91%, respectively, in narcoleptic mice, and accuracies of 93% and 89%, respectively, in wild-type mice. Notably, the supervised approach generalized well on previously unseen data from the same animals on which it was trained but exhibited lower performance on animals not present in the training data due to inter-subject variability. Cataplexy is scored with a sensitivity of 85% and 57% using the supervised and unsupervised approaches, respectively, when compared to manual scoring, and the specificity exceeds 99% in both cases.

Download Full-text

Repeatability enhancement of time-lapse seismic data via a convolutional autoencoder

Geophysical Journal International ◽

10.1093/gji/ggab397 ◽

2021 ◽

Author(s):

Hyunggu Jun ◽

Yongchae Cho

Keyword(s):

Machine Learning ◽

Seismic Data ◽

Network Architecture ◽

Numerical Experiments ◽

Data Augmentation ◽

Proper Time ◽

Time Lapse ◽

Training Data ◽

Convolutional Autoencoder ◽

Target Data

Summary In an ideal case, the time-lapse differences in 4D seismic data should only reflect the changes of the subsurface geology. Practically, however, undesirable discrepancies are generated because of various reasons. Therefore, proper time-lapse processing techniques are required to improve the repeatability of time-lapse seismic data and to capture accurate seismic information to analyze target changes. In this study, we propose a machine learning-based time-lapse seismic data processing method improving repeatability. A training data construction method, training strategy, and machine learning network architecture based on a convolutional autoencoder are proposed. Uniform manifold approximation and projection are applied to the training and target data to analyze the features corresponding to each data point. When the feature distribution of the training data is different from the target data, we implement data augmentation to enhance the diversity of the training data. The method is verified through numerical experiments using both synthetic and field time-lapse seismic data, and the results are analyzed with several methods, including a comparison of repeatability metrics. From the results of the numerical experiments, we can conclude that the proposed convolutional autoencoder can enhance the repeatability of the time-lapse seismic data and increase the accuracy of observed variations in seismic signals generated from target changes.

Download Full-text

Multitrace first-break picking using an integrated seismic and machine learning method

Geophysics ◽

10.1190/geo2019-0422.1 ◽

2020 ◽

Vol 85 (4) ◽

pp. WA269-WA277

Author(s):

Xudong Duan ◽

Jie Zhang

Keyword(s):

Machine Learning ◽

Network Architecture ◽

Recall Rate ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

3D Data ◽

Training Samples ◽

Human Effort ◽

2D And 3D

Picking the first breaks from seismic data is often a challenging problem and still requires significant human effort. We have developed an iterative process that applies a traditional seismic automated picking method to obtain preliminary first breaks and then uses a machine learning (ML) method to identify, remove, and fix poor picks based on a multitrace analysis. The ML method involves constructing a convolutional neural network architecture to help identify poor picks across multiple traces and eliminate them. We then further refill the picks on empty traces with the help of the trained model. To allow training samples applicable to various regions and different data sets, we apply moveout correction with preliminary picks and address the picks in the flattened input. We collect 11,239,800 labeled seismic traces. During the training process, the model’s classification accuracy on the training and validation data sets reaches 98.2% and 97.3%, respectively. We also evaluate the precision and recall rate, both of which exceed 94%. For prediction, the results of 2D and 3D data sets that differ from the training data sets are used to demonstrate the feasibility of our method.

Download Full-text

A data efficient machine learning model for autonomous operational avalanche forecasting

10.5194/nhess-2021-106 ◽

2021 ◽

Author(s):

Manesh Chawla ◽

Amreek Singh

Keyword(s):

Machine Learning ◽

Training Data ◽

Intuitive Reasoning ◽

Avalanche Forecasting ◽

Learning Classifier ◽

Machine Learning Model ◽

Proposed Model ◽

Efficient Machine ◽

Stability Data ◽

Data Efficiency

Abstract. Snow avalanches pose serious hazard to people and property in snow bound mountains. Snow mass sliding downslope can gain sufficient momentum to destroy buildings, uproot trees and kill people. Forecasting and in turn avoiding exposure to avalanches is a much practiced measure to mitigate hazard world over. However, sufficient snow stability data for accurate forecasting is generally difficult to collect. Hence forecasters infer snow stability largely through intuitive reasoning based upon their knowledge of local weather, terrain and sparsely available snowpack observations. Machine learning models may add more objectivity to this intuitive inference process. In this paper we propose a data efficient machine learning classifier using the technique of Random Forest. The model can be trained with significantly lesser training data compared to other avalanche forecasting models and it generates useful outputs to minimise and quantify uncertainty. Besides, the model generates intricate reasoning descriptions which are difficult to observe manually. Furthermore, the model data requirement can be met through automatic systems. The proposed model advances the field by being inexpensive and convenient for operational use due to its data efficiency and ability to describe its decisions besides the potential of lending autonomy to the process.

Download Full-text