Assessing the Prevalence of Suspicious Activities in Asphalt Pavement Construction Using Algorithmic Logics and Machine Learning

Author(s):  
Mostofa Najmus Sakib

Quality Control (QC) and Quality Assurance (QA) is a planned systematic approach to secure the satisfactory performance of Hot mix asphalt (HMA) construction projects. Millions of dollars are invested by government and state highway agencies to construct large-scale HMA construction projects. QC/QA is statistical approach for checking the desired construction properties through independent testing. The practice of QC/QA has been encouraged by the Federal Highway Administration (FHWA) since the mid 60's. However, the standard QC/QA practice is often criticized on how effective such statistical tests and how representative the reported material tests are. Material testing data alteration in the HMA construction sector can render the QC/QA practice ineffective and shadow the performance of asphalt pavements. The American Society of Civil Engineers estimates that $340 billion is lost globally each year due to corruption in the construction industry. Asphalt pavement construction consists of several sectors, including construction and transportation, which are prone to potential suspicious activities. There is approximately 18 billion tons of asphalt pavement on American roads, which makes the costs of potential suspicious activities unacceptably large. The Idaho Transportation Department (ITD) relies on contractor-produced QC test results for the payment of the HMA pavement projects. In 2017, a case study by FHWA found some unnatural trends where 74% of the ITD test results didn't match with the contractor results. ITD's approach to track down the accuracy of mix design and volumetric test data set the off-stage of this research to mark out instances of suspicious activities in asphalt pavement projects. The first objective of this research was to develop algorithmic logics to recognize the patterns of discrepancies in agency- and contractor-produced QC/QA test results. This was possible with a unique dataset that ITD collected from several dozen HMA projects, in which all instances of data entry into the material testing report file was recorded in the background, without the operators' knowledge. My solution was bifurcated into development of an algorithm combining the logics to automatically detect and categorize suspicious instances when multiple data entries were observed. Modern data mining approaches were also used to explore the latent insights and screen out suspicious incidences to identify the chances of suboptimal materials used for paving and extra payment in HMA pavement projects. I have also successfully prompted supervised machine learning techniques to detect suspicious cases of data alterations. The second step of this research was to calculate the monetary losses due to data alteration. I replicated ITD's procedure for HMA payment calculation, and quantified payment-related parameters and associated payment for each project for two cases: 1. when the first parameter value categorized as Suspicious Alteration (S.A.) was used for payment calculation, and 2. when the last S.A. parameter value was used for payment. It was evident from my findings that there has been overpayment on construction projects across Idaho due to material testing data alterations. Overall, based on the available audit data, I found that overpayments have ranged from $14,000 to $360,000. Further analysis showed that alteration of each major material testing parameter's value can cause roughly $1,000 to $5,000 overpayment. I also note that data alteration did not always cause monetary gains. Other possible motives may include passing Percent Within Limit (PWL) criteria and precision criteria. Throughout the research, I strive to automate a suspicious activity detection system and calculate the associated excessive payment.

Author(s):  
V.T Priyanga ◽  
J.P Sanjanasri ◽  
Vijay Krishna Menon ◽  
E.A Gopalakrishnan ◽  
K.P Soman

The widespread use of social media like Facebook, Twitter, Whatsapp, etc. has changed the way News is created and published; accessing news has become easy and inexpensive. However, the scale of usage and inability to moderate the content has made social media, a breeding ground for the circulation of fake news. Fake news is deliberately created either to increase the readership or disrupt the order in the society for political and commercial benefits. It is of paramount importance to identify and filter out fake news especially in democratic societies. Most existing methods for detecting fake news involve traditional supervised machine learning which has been quite ineffective. In this paper, we are analyzing word embedding features that can tell apart fake news from true news. We use the LIAR and ISOT data set. We churn out highly correlated news data from the entire data set by using cosine similarity and other such metrices, in order to distinguish their domains based on central topics. We then employ auto-encoders to detect and differentiate between true and fake news while also exploring their separability through network analysis.


Diagnostics ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 642
Author(s):  
Yi-Da Wu ◽  
Ruey-Kai Sheu ◽  
Chih-Wei Chung ◽  
Yen-Ching Wu ◽  
Chiao-Chi Ou ◽  
...  

Background: Antinuclear antibody pattern recognition is vital for autoimmune disease diagnosis but labor-intensive for manual interpretation. To develop an automated pattern recognition system, we established machine learning models based on the International Consensus on Antinuclear Antibody Patterns (ICAP) at a competent level, mixed patterns recognition, and evaluated their consistency with human reading. Methods: 51,694 human epithelial cells (HEp-2) cell images with patterns assigned by experienced medical technologists collected in a medical center were used to train six machine learning algorithms and were compared by their performance. Next, we choose the best performing model to test the consistency with five experienced readers and two beginners. Results: The mean F1 score in each classification of the best performing model was 0.86 evaluated by Testing Data 1. For the inter-observer agreement test on Testing Data 2, the average agreement was 0.849 (?) among five experienced readers, 0.844 between the best performing model and experienced readers, 0.528 between experienced readers and beginners. The results indicate that the proposed model outperformed beginners and achieved an excellent agreement with experienced readers. Conclusions: This study demonstrated that the developed model could reach an excellent agreement with experienced human readers using machine learning methods.


2021 ◽  
Vol 10 (7) ◽  
pp. 436
Author(s):  
Amerah Alghanim ◽  
Musfira Jilani ◽  
Michela Bertolotto ◽  
Gavin McArdle

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.


2017 ◽  
Vol 11 (04) ◽  
pp. 497-511
Author(s):  
Elnaz Davoodi ◽  
Leila Kosseim ◽  
Matthew Mongrain

This paper evaluates the effect of the context of a target word on the identification of complex words in natural language texts. The approach automatically tags words as either complex or not, based on two sets of features: base features that only pertain to the target word, and contextual features that take the context of the target word into account. We experimented with several supervised machine learning models, and trained and tested the approach with the 2016 SemEval Word Complexity Data Set. Results show that when discriminating base features are used, the words around the target word can supplement those features and improve the recognition of complex words.


2019 ◽  
Vol 8 (4) ◽  
pp. 9155-9158

Classification is a machine learning task which consists in predicting the set association of unclassified examples, whose label is not known, by the properties of examples in a representation learned earlier as of training examples, that label was known. Classification tasks contain a huge assortment of domains and real world purpose: disciplines such as medical diagnosis, bioinformatics, financial engineering and image recognition between others, where domain experts can use the model erudite to sustain their decisions. All the Classification Approaches proposed in this paper were evaluate in an appropriate experimental framework in R Programming Language and the major emphasis is on k-nearest neighbor method which supports vector machines and decision trees over large number of data sets with varied dimensionality and by comparing their performance against other state-of-the-art methods. In this process the experimental results obtained have been verified by statistical tests which support the better performance of the methods. In this paper we have survey various classification techniques of Data Mining and then compared them by using diverse datasets from “University of California: Irvine (UCI) Machine Learning Repository” for acquiring the accurate calculations on Iris Data set.


2017 ◽  
Author(s):  
Daniel R. Schrider ◽  
Julien Ayroles ◽  
Daniel R. Matute ◽  
Andrew D. Kern

ABSTRACTHybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia.AUTHOR SUMMARYUnderstanding the extent to which species or diverged populations hybridize in nature is crucially important if we are to understand the speciation process. Accordingly numerous research groups have developed methodology for finding the genetic evidence of such introgression. In this report we develop a supervised machine learning approach for uncovering loci which have introgressed across species boundaries. We show that our method, FILET, has greater accuracy and power than competing methods in discovering introgression, and in addition can detect the directionality associated with the gene flow between species. Using whole genome sequences from Drosophila simulans and Drosophila sechellia we show that FILET discovers quite extensive introgression between these species that has occurred mostly from D. simulans to D. sechellia. Our work highlights the complex process of speciation even within a well-studied system and points to the growing importance of supervised machine learning in population genetics.


Missing data arise major issues in the large database regarding quantitative analysis. Due to this issues, the inference of the computational process produce bias results, more damage of data, the error rate can increase, and more difficult to accomplish the process of imputation. Prediction of disguised missing data occurs in the large data sets are another major problems in real time operation. Machine learning (ML) techniques to connect with the classification of measurement to enforce the accuracy rate of predictive values. These techniques overcome the various challenges to the problem of losing data. Recent work based on the prediction of misclassification using supervised ML approach; to predict an output for an unseen input with limited parameters in a data set. When increase the size of parameter, then it generates the outcome of less accuracy rate. This article presented a new approach COBACO, an effective supervised machine learning technique. Several strategies describe the classification of predictive techniques for missing data analysis in efficient supervised machine learning techniques. The proposed predictive techniques COBACO generated more precise, accurate results than the other predictive approaches. The Experimental results obtained using both real and synthetic data set show that the proposed approach offers a valuable and promising insight to the problem of prediction of missing information.


2021 ◽  
Author(s):  
Sang Min Nam ◽  
Thomas A Peterson ◽  
Kyoung Yul Seo ◽  
Hyun Wook Han ◽  
Jee In Kang

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i>&lt;.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.


2021 ◽  
Vol 2021 ◽  
pp. 1-18
Author(s):  
Aurelle Tchagna Kouanou ◽  
Thomas Mih Attia ◽  
Cyrille Feudjio ◽  
Anges Fleurio Djeumo ◽  
Adèle Ngo Mouelas ◽  
...  

Background and Objective. To mitigate the spread of the virus responsible for COVID-19, known as SARS-CoV-2, there is an urgent need for massive population testing. Due to the constant shortage of PCR (polymerase chain reaction) test reagents, which are the tests for COVID-19 by excellence, several medical centers have opted for immunological tests to look for the presence of antibodies produced against this virus. However, these tests have a high rate of false positives (positive but actually negative test results) and false negatives (negative but actually positive test results) and are therefore not always reliable. In this paper, we proposed a solution based on Data Analysis and Machine Learning to detect COVID-19 infections. Methods. Our analysis and machine learning algorithm is based on most cited two clinical datasets from the literature: one from San Raffaele Hospital Milan Italia and the other from Hospital Israelita Albert Einstein São Paulo Brasilia. The datasets were processed to select the best features that most influence the target, and it turned out that almost all of them are blood parameters. EDA (Exploratory Data Analysis) methods were applied to the datasets, and a comparative study of supervised machine learning models was done, after which the support vector machine (SVM) was selected as the one with the best performance. Results. SVM being the best performant is used as our proposed supervised machine learning algorithm. An accuracy of 99.29%, sensitivity of 92.79%, and specificity of 100% were obtained with the dataset from Kaggle (https://www.kaggle.com/einsteindata4u/covid19) after applying optimization to SVM. The same procedure and work were performed with the dataset taken from San Raffaele Hospital (https://zenodo.org/record/3886927#.YIluB5AzbMV). Once more, the SVM presented the best performance among other machine learning algorithms, and 92.86%, 93.55%, and 90.91% for accuracy, sensitivity, and specificity, respectively, were obtained. Conclusion. The obtained results, when compared with others from the literature based on these same datasets, are superior, leading us to conclude that our proposed solution is reliable for the COVID-19 diagnosis.


2021 ◽  
Author(s):  
Marc Raphael ◽  
Michael Robitaille ◽  
Jeff Byers ◽  
Joseph Christodoulides

Abstract Machine learning algorithms hold the promise of greatly improving live cell image analysis by way of (1) analyzing far more imagery than can be achieved by more traditional manual approaches and (2) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. Currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. To address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. The approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. Because it is self-trained, the software has no user-adjustable parameters and does not require curated training imagery. The algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (DIC), transmitted light and interference reflection microscopy (IRM). The approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type.


Sign in / Sign up

Export Citation Format

Share Document