From Tax Compliance in Natural Language to Executable Calculations: Combining Lexical-grammar-based Parsing and Machine Learning

Regulatory agencies publish tax-compliance content written in natural language intended for human consumption. There has been very little work on automated methods for interpreting this content and for generating executable calculations from it. In this paper, we describe a combination of lexical grammar-based parsing with encoder-decoder architectures for automatically bootstrapping executable calculations from natural language. The combination is particularly suitable for domains such as compliance where training data is scarce and accuracy of interpretation is of high importance. We provide an overview of the implementation for North American income-tax forms.

Download Full-text

Opinion Mining and Information Retrieval

Handbook of Research on Ambient Intelligence and Smart Environments - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-61692-857-5.ch030 ◽

2011 ◽

pp. 640-652

Author(s):

Shishir K. Shandilya ◽

Suresh Jain

Keyword(s):

Machine Learning ◽

Natural Language ◽

Language Processing ◽

Ambient Intelligence ◽

Opinion Mining ◽

Training Data ◽

Machine Learning Techniques ◽

Web Documents ◽

Opinion Extraction ◽

Traditional Natural

The explosive increase in Internet usage has attracted technologies for automatically mining the user-generated contents (UGC) from Web documents. These UGC-rich resources have raised new opportunities and challenges to carry out the opinion extraction and mining tasks for opinion summaries. The technology of opinion extraction allows users to retrieve and analyze people’s opinions scattered over Web documents. Opinion mining is a process which is concerned with the opinions generated by the consumers about the product. Opinion Mining aims at understanding, extraction and classification of opinions scattered in unstructured text of online resources. The search engines performs well when one wants to know about any product before purchase, but the filtering and analysis of search results often complex and time-consuming. This generated the need of intelligent technologies which could process these unstructured online text documents through automatic classification, concept recognition, text summarization, etc. These tools are based on traditional natural language techniques, statistical analysis, and machine learning techniques. Automatic knowledge extraction over large text collections like Internet has been a challenging task due to many constraints such as needs of large annotated training data, requirement of extensive manual processing of data, and huge amount of domain-specific terms. Ambient Intelligence (AmI) in wed-enabled technologies supports and promotes the intelligent e-commerce services to enable the provision of personalized, self-configurable, and intuitive applications for facilitating UGC knowledge for buying confidence. In this chapter, we will discuss various approaches of Opinion Mining which combines Ambient Intelligence, Natural Language Processing and Machine Learning methods based on textual and grammatical clues.

Download Full-text

Decoding EEG Brain Activity for Multi-Modal Natural Language Processing

Frontiers in Human Neuroscience ◽

10.3389/fnhum.2021.659410 ◽

2021 ◽

Vol 15 ◽

Author(s):

Nora Hollenstein ◽

Cedric Renggli ◽

Benjamin Glaus ◽

Maria Barrett ◽

Marius Troendle ◽

...

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Large Scale ◽

Brain Activity ◽

Training Data ◽

Human Cognition ◽

Special Focus ◽

Eeg Data

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.

Download Full-text

Identification of North American Softwoods via Machine-Learning

Canadian Journal of Forest Research ◽

10.1139/cjfr-2020-0416 ◽

2021 ◽

Author(s):

Dercilio Junior Verly Lopes ◽

Gabrielly dos Santos Bobadilha ◽

Greg W. Burgreen ◽

Edward D. Entsminger

Keyword(s):

Machine Learning ◽

North American ◽

Ponderosa Pine ◽

Data Augmentation ◽

Input Image ◽

Training Data ◽

Training Dataset ◽

Image Size ◽

Accurate Identification ◽

Convolutional Network

This manuscript reports the feasibility of a sequential convolutional neural network (CNN) machine-learning model that correctly identifies eleven (11) North American softwood species from 14x magnified macroscopic end-grain images. The convolutional network contained a large kernel size, max pooling layers, and leaky rectified linear units to accelerate training. To reduce overfitting of training data, we employed L2 regularization, custom initialization, and stratified 5-fold cross-validation techniques. The database consisted of 1,789 wood end-grain images. The training dataset consisted of 1,431 images, whereas the validation set had approximately 358 images. In both sets, the input image size was 227 pixels x 227 pixels. Data augmentation was performed on-the-fly by flipping, rotating, and zooming the images. We tested the performance of the CNN against precision, sensitivity, specificity, F1-score, and adjusted accuracy. The adjusted accuracy for the entire model was 94.0%. Confusion matrices indicated the lowest performance was in correctly classifying Ponderosa pine and Eastern spruce group with an average sensitivity of 89.0% for each. Even though high validation accuracy (>94.0%) was achieved, we concluded that a much larger dataset is needed for wood identification to obtain industrially accurate identification of softwoods, mainly due to their visual and macroscopic similarities.

Download Full-text

Weakly Supervised Learning for Categorization of Medical Inquiries for Customer Service Effectiveness

Frontiers in Research Metrics and Analytics ◽

10.3389/frma.2021.683400 ◽

2021 ◽

Vol 6 ◽

Author(s):

Shikha Singhal ◽

Bharat Hegde ◽

Prathamesh Karmalkar ◽

Justna Muhith ◽

Harsha Gurulingappa

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Supervised Learning ◽

Language Processing ◽

Medical Information ◽

Service Providers ◽

Training Data ◽

Weakly Supervised Learning ◽

Weakly Supervised

With the growing unstructured data in healthcare and pharmaceutical, there has been a drastic adoption of natural language processing for generating actionable insights from text data sources. One of the key areas of our exploration is the Medical Information function within our organization. We receive a significant amount of medical information inquires in the form of unstructured text. An enterprise-level solution must deal with medical information interactions via multiple communication channels which are always nuanced with a variety of keywords and emotions that are unique to the pharmaceutical industry. There is a strong need for an effective solution to leverage the contextual knowledge of the medical information business along with digital tenants of natural language processing (NLP) and machine learning to build an automated and scalable process that generates real-time insights on conversation categories. The traditional supervised learning methods rely on a huge set of manually labeled training data and this dataset is difficult to attain due to high labeling costs. Thus, the solution is incomplete without its ability to self-learn and improve. This necessitates techniques to automatically build relevant training data using a weakly supervised approach from textual inquiries across consumers, healthcare professionals, sales, and service providers. The solution has two fundamental layers of NLP and machine learning. The first layer leverages heuristics and knowledgebase to identify the potential categories and build an annotated training data. The second layer, based on machine learning and deep learning, utilizes the training data generated using the heuristic approach for identifying categories and sub-categories associated with verbatim. Here, we present a novel approach harnessing the power of weakly supervised learning combined with multi-class classification for improved categorization of medical information inquiries.

Download Full-text

Digital Economy Tax Compliance Model in Malaysia using Machine Learning Approach

Sains Malaysiana ◽

10.17576/jsm-2021-5007-20 ◽

2021 ◽

Vol 50 (7) ◽

pp. 2059-2077

Author(s):

Raja Azhan Syah Raja Wahab ◽

Azuraliza Abu Bakar

Keyword(s):

Income Tax ◽

Predictive Analytics ◽

Tax Compliance ◽

Digital Economy ◽

Ensemble Classification ◽

Training Data ◽

Classification Model ◽

Performance Improvements ◽

Compliance Level ◽

Hidden Knowledge

The field of digital economy income tax compliance is still in its infancy. The limited collection of government income taxes has forced the Inland Revenue Board of Malaysia (IRBM) to develop a solution to improve the tax compliance of the digital economy sector so that its taxpayers may report voluntary income or take firm action. The ability to diagnose the taxpayer's compliance will ensure the IRBM effectively collects the income tax and gives revenues to the country. However, it gives challenges in extracting necessary knowledge from a large amount of data, leading to the need for a predictive model to detect the taxpayers' compliance level. This paper proposes the descriptive and predictive analytics models for predicting the digital economic income tax compliance in Malaysia. We conduct descriptive analytics to explore and extract a summary of data for initial understanding. Through a brief description of the descriptive model, the data distribution in a histogram shows that the information extracted can give a clear picture in influencing the results to classify digital economic tax compliance. In predictive modeling, single and ensemble approaches are employed to find the best model and important factors contributing to the incompliance of tax payment among the digital economic retailers. Based on the validation of training data with the presence of seven single classifier algorithms, three performance improvements have been established through ensemble classification, namely wrapper, boosting, and voting methods, and two techniques involving grid search and evolution parameters. The experimental results show that the ensemble method can improve the single classification model's accuracy with the highest classification accuracy of 87.94% compared to the best single classification model. The knowledge analysis phase learns meaningful features and hidden knowledge that could classify the contexts of taxpayers that could potentially influence the degree of tax compliance in the digital economy are categorized. Overall, this collection of information has the potential to help stakeholders make future decisions on the tax compliance of the digital economy.

Download Full-text

An AdaBoost Using a Weak-Learner Generating Several Weak-Hypotheses for Large Training Data of Natural Language Processing

IEEJ Transactions on Electronics Information and Systems ◽

10.1541/ieejeiss.130.83 ◽

2010 ◽

Vol 130 (1) ◽

pp. 83-91 ◽

Cited By ~ 1

Author(s):

Tomoya Iwakura ◽

Seishi Okamoto ◽

Kazuo Asakawa

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Training Data ◽

Weak Learner

Download Full-text

The Development of Automated Methods of Generation of Official Documents in Natural Language

Vestnik NSU Series Information Technologies ◽

10.25205/1818-7900-2017-15-3-79-89 ◽

2017 ◽

Vol 15 (3) ◽

pp. 79-89

Author(s):

D. E. Palchunov ◽

◽

A. A. Fink ◽

◽

Keyword(s):

Natural Language ◽

Automated Methods

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text