scholarly journals From Tax Compliance in Natural Language to Executable Calculations: Combining Lexical-grammar-based Parsing and Machine Learning

Author(s):  
Esme Manandise ◽  
Conrad De Peuter ◽  
Saikat Mukherjee

Regulatory agencies publish tax-compliance content written in natural language intended for human consumption. There has been very little work on automated methods for interpreting this content and for generating executable calculations from it. In this paper, we describe a combination of lexical grammar-based parsing with encoder-decoder architectures for automatically bootstrapping executable calculations from natural language. The combination is particularly suitable for domains such as compliance where training data is scarce and accuracy of interpretation is of high importance. We provide an overview of the implementation for North American income-tax forms.

Author(s):  
Shishir K. Shandilya ◽  
Suresh Jain

The explosive increase in Internet usage has attracted technologies for automatically mining the user-generated contents (UGC) from Web documents. These UGC-rich resources have raised new opportunities and challenges to carry out the opinion extraction and mining tasks for opinion summaries. The technology of opinion extraction allows users to retrieve and analyze people’s opinions scattered over Web documents. Opinion mining is a process which is concerned with the opinions generated by the consumers about the product. Opinion Mining aims at understanding, extraction and classification of opinions scattered in unstructured text of online resources. The search engines performs well when one wants to know about any product before purchase, but the filtering and analysis of search results often complex and time-consuming. This generated the need of intelligent technologies which could process these unstructured online text documents through automatic classification, concept recognition, text summarization, etc. These tools are based on traditional natural language techniques, statistical analysis, and machine learning techniques. Automatic knowledge extraction over large text collections like Internet has been a challenging task due to many constraints such as needs of large annotated training data, requirement of extensive manual processing of data, and huge amount of domain-specific terms. Ambient Intelligence (AmI) in wed-enabled technologies supports and promotes the intelligent e-commerce services to enable the provision of personalized, self-configurable, and intuitive applications for facilitating UGC knowledge for buying confidence. In this chapter, we will discuss various approaches of Opinion Mining which combines Ambient Intelligence, Natural Language Processing and Machine Learning methods based on textual and grammatical clues.


2021 ◽  
Vol 15 ◽  
Author(s):  
Nora Hollenstein ◽  
Cedric Renggli ◽  
Benjamin Glaus ◽  
Maria Barrett ◽  
Marius Troendle ◽  
...  

Until recently, human behavioral data from reading has mainly been of interest to researchers to understand human cognition. However, these human language processing signals can also be beneficial in machine learning-based natural language processing tasks. Using EEG brain activity for this purpose is largely unexplored as of yet. In this paper, we present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks, with a special focus on which features of the signal are most beneficial. We present a multi-modal machine learning architecture that learns jointly from textual input as well as from EEG features. We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal. Moreover, for a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines. For more complex tasks such as relation detection, only the contextualized BERT embeddings outperform the baselines in our experiments, which raises the need for further research. Finally, EEG data shows to be particularly promising when limited training data is available.


Author(s):  
Dercilio Junior Verly Lopes ◽  
Gabrielly dos Santos Bobadilha ◽  
Greg W. Burgreen ◽  
Edward D. Entsminger

This manuscript reports the feasibility of a sequential convolutional neural network (CNN) machine-learning model that correctly identifies eleven (11) North American softwood species from 14x magnified macroscopic end-grain images. The convolutional network contained a large kernel size, max pooling layers, and leaky rectified linear units to accelerate training. To reduce overfitting of training data, we employed L2 regularization, custom initialization, and stratified 5-fold cross-validation techniques. The database consisted of 1,789 wood end-grain images. The training dataset consisted of 1,431 images, whereas the validation set had approximately 358 images. In both sets, the input image size was 227 pixels x 227 pixels. Data augmentation was performed on-the-fly by flipping, rotating, and zooming the images. We tested the performance of the CNN against precision, sensitivity, specificity, F1-score, and adjusted accuracy. The adjusted accuracy for the entire model was 94.0%. Confusion matrices indicated the lowest performance was in correctly classifying Ponderosa pine and Eastern spruce group with an average sensitivity of 89.0% for each. Even though high validation accuracy (>94.0%) was achieved, we concluded that a much larger dataset is needed for wood identification to obtain industrially accurate identification of softwoods, mainly due to their visual and macroscopic similarities.


Author(s):  
Shikha Singhal ◽  
Bharat Hegde ◽  
Prathamesh Karmalkar ◽  
Justna Muhith ◽  
Harsha Gurulingappa

With the growing unstructured data in healthcare and pharmaceutical, there has been a drastic adoption of natural language processing for generating actionable insights from text data sources. One of the key areas of our exploration is the Medical Information function within our organization. We receive a significant amount of medical information inquires in the form of unstructured text. An enterprise-level solution must deal with medical information interactions via multiple communication channels which are always nuanced with a variety of keywords and emotions that are unique to the pharmaceutical industry. There is a strong need for an effective solution to leverage the contextual knowledge of the medical information business along with digital tenants of natural language processing (NLP) and machine learning to build an automated and scalable process that generates real-time insights on conversation categories. The traditional supervised learning methods rely on a huge set of manually labeled training data and this dataset is difficult to attain due to high labeling costs. Thus, the solution is incomplete without its ability to self-learn and improve. This necessitates techniques to automatically build relevant training data using a weakly supervised approach from textual inquiries across consumers, healthcare professionals, sales, and service providers. The solution has two fundamental layers of NLP and machine learning. The first layer leverages heuristics and knowledgebase to identify the potential categories and build an annotated training data. The second layer, based on machine learning and deep learning, utilizes the training data generated using the heuristic approach for identifying categories and sub-categories associated with verbatim. Here, we present a novel approach harnessing the power of weakly supervised learning combined with multi-class classification for improved categorization of medical information inquiries.


2021 ◽  
Vol 50 (7) ◽  
pp. 2059-2077
Author(s):  
Raja Azhan Syah Raja Wahab ◽  
Azuraliza Abu Bakar

The field of digital economy income tax compliance is still in its infancy. The limited collection of government income taxes has forced the Inland Revenue Board of Malaysia (IRBM) to develop a solution to improve the tax compliance of the digital economy sector so that its taxpayers may report voluntary income or take firm action. The ability to diagnose the taxpayer's compliance will ensure the IRBM effectively collects the income tax and gives revenues to the country. However, it gives challenges in extracting necessary knowledge from a large amount of data, leading to the need for a predictive model to detect the taxpayers' compliance level. This paper proposes the descriptive and predictive analytics models for predicting the digital economic income tax compliance in Malaysia. We conduct descriptive analytics to explore and extract a summary of data for initial understanding. Through a brief description of the descriptive model, the data distribution in a histogram shows that the information extracted can give a clear picture in influencing the results to classify digital economic tax compliance. In predictive modeling, single and ensemble approaches are employed to find the best model and important factors contributing to the incompliance of tax payment among the digital economic retailers. Based on the validation of training data with the presence of seven single classifier algorithms, three performance improvements have been established through ensemble classification, namely wrapper, boosting, and voting methods, and two techniques involving grid search and evolution parameters. The experimental results show that the ensemble method can improve the single classification model's accuracy with the highest classification accuracy of 87.94% compared to the best single classification model. The knowledge analysis phase learns meaningful features and hidden knowledge that could classify the contexts of taxpayers that could potentially influence the degree of tax compliance in the digital economy are categorized. Overall, this collection of information has the potential to help stakeholders make future decisions on the tax compliance of the digital economy.


2019 ◽  
Author(s):  
Andrew Medford ◽  
Shengchun Yang ◽  
Fuzhu Liu

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CH<sub>x</sub>, NH<sub>x</sub> and OH<sub>x</sub> species on the oxygen vacancy and pristine rutile TiO<sub>2</sub>(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N<sup>1.12</sup>) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.


2018 ◽  
Vol 6 (2) ◽  
pp. 283-286
Author(s):  
M. Samba Siva Rao ◽  
◽  
M.Yaswanth . ◽  
K. Raghavendra Swamy ◽  
◽  
...  

Sign in / Sign up

Export Citation Format

Share Document