Simpler and Faster Development of Tumor Phylogeny Pipelines

In the recent years there has been an increasing amount of single-cell sequencing (SCS) studies, producing a considerable number of new datasets. This has particularly affected the field of cancer analysis, where more and more papers are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples. To this goal we developed *plastic*, an easy-to-use and quick to adapt pipeline that integrates three different steps: (1) to simplify the input data; (2) to infer tumor phylogenies; and (3) to compare the phylogenies. We have created a pipeline submodule for each of those steps, and developed new in-memory data structures that allow for easy and transparent sharing of the information across the tools implementing the above steps. While we use existing open source tools for those steps, we have extended the tool used for simplifying the input data, incorporating two machine learning procedures --- which greatly reduce the running time without affecting the quality of the downstream analysis. Moreover, we have introduced the capability of producing some plots to quickly visualize results.

Download Full-text

Improving Real-Time Drilling Data Quality Using Artificial Intelligence and Machine Learning Techniques

10.2118/204658-ms ◽

2021 ◽

Author(s):

S. H. Al Gharbi ◽

A. A. Al-Majed ◽

A. Abdulraheem ◽

S. Patil ◽

S. M. Elkatatny

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Quality ◽

Real Time ◽

Input Data ◽

Support Vector ◽

The Real ◽

Drilling Data ◽

Drilling Operations

Abstract Due to high demand for energy, oil and gas companies started to drill wells in remote areas and unconventional environments. This raised the complexity of drilling operations, which were already challenging and complex. To adapt, drilling companies expanded their use of the real-time operation center (RTOC) concept, in which real-time drilling data are transmitted from remote sites to companies’ headquarters. In RTOC, groups of subject matter experts monitor the drilling live and provide real-time advice to improve operations. With the increase of drilling operations, processing the volume of generated data is beyond a human's capability, limiting the RTOC impact on certain components of drilling operations. To overcome this limitation, artificial intelligence and machine learning (AI/ML) technologies were introduced to monitor and analyze the real-time drilling data, discover hidden patterns, and provide fast decision-support responses. AI/ML technologies are data-driven technologies, and their quality relies on the quality of the input data: if the quality of the input data is good, the generated output will be good; if not, the generated output will be bad. Unfortunately, due to the harsh environments of drilling sites and the transmission setups, not all of the drilling data is good, which negatively affects the AI/ML results. The objective of this paper is to utilize AI/ML technologies to improve the quality of real-time drilling data. The paper fed a large real-time drilling dataset, consisting of over 150,000 raw data points, into Artificial Neural Network (ANN), Support Vector Machine (SVM) and Decision Tree (DT) models. The models were trained on the valid and not-valid datapoints. The confusion matrix was used to evaluate the different AI/ML models including different internal architectures. Despite the slowness of ANN, it achieved the best result with an accuracy of 78%, compared to 73% and 41% for DT and SVM, respectively. The paper concludes by presenting a process for using AI technology to improve real-time drilling data quality. To the author's knowledge based on literature in the public domain, this paper is one of the first to compare the use of multiple AI/ML techniques for quality improvement of real-time drilling data. The paper provides a guide for improving the quality of real-time drilling data.

Download Full-text

DATA QUALITY CONSIDERATIONS FOR PETROPHYSICAL MACHINE LEARNING MODELS

10.30632/spwla-2021-0036 ◽

2021 ◽

Author(s):

Andrew McDonald ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Quality ◽

Input Data ◽

Well Log ◽

Learning Models ◽

Log Data ◽

Quality Issues ◽

Machine Learning Models

Decades of subsurface exploration and characterisation have led to the collation and storage of large volumes of well related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data is of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well log data, which can be highly multi-dimensional, diverse and stored in a variety of file formats. Well log data exhibits key characteristics of Big Data: Volume, Variety, Velocity, Veracity and Value. Well data can include numeric values, text values, waveform data, image arrays, maps, volumes, etc. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine learning models. Well log data can be affected by numerous issues causing a degradation in data quality. These include missing data - ranging from single data points to entire curves; noisy data from tool related issues; borehole washout; processing issues; incorrect environmental corrections; and mislabelled data. Having vast quantities of data does not mean it can all be passed into a machine learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data is passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well log data and deploying machine learning models. First, an overview of machine learning and Big Data is covered in relation to petrophysical applications. Secondly, data quality issues commonly faced with well log data are discussed. Thirdly, methods are suggested on how to deal with data issues prior to modelling. Finally, multiple case studies are discussed covering the impacts of data quality on predictive capability.

Download Full-text

High Stability Anomaly Detection in Random Environments

The International FLAIRS Conference Proceedings ◽

10.32473/flairs.v34i0.128871 ◽

2021 ◽

Vol 34 ◽

Author(s):

Masaru Ide

Keyword(s):

Machine Learning ◽

Time Series ◽

Anomaly Detection ◽

Autonomous Vehicles ◽

Input Data ◽

Learning Systems ◽

Detection Methods ◽

Random Environments ◽

Pairwise Correlation

We propose anomaly detection to refine input data for predictive machine learning systems. When training, if there are outliers such as spike noises mixed in the input data, the quality of the trained model is deteriorated. The removing such outliers would be expected the service quality of machine learning systems improves such as autonomous vehicles and ship navigation. Conventionally, anomaly detection methods generally require the support of domain experts, and they could not treat with unstable random environments well. We propose a new anomaly detection method, which is highly stable and is capable of treating with random environments without experts. The proposed methods focus on a pairwise correlation between two input time-series, change rates of them are calculated and summarized on a quadrant chart for further analysis. The experiment using an open time-series dataset shows that the proposed methods successfully detect anomalies, and the detected data points are easily illustrated in a human-interpretable way.

Download Full-text

Tomographic image correction with noise reduction algorithms

MATEC Web of Conferences ◽

10.1051/matecconf/201925209001 ◽

2019 ◽

Vol 252 ◽

pp. 09001

Author(s):

Grzegorz Kłosowski ◽

Tomasz Rymarczyk ◽

Edward Kozłowski

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Input Data ◽

Tomographic Image ◽

Image Correction ◽

Simulation Experiments ◽

Tomographic Images ◽

Input Variables ◽

Original Approach

This article presents an original approach to improve the results of tomographic reconstructions by denoising the input data, which affects output images improving. The algorithms used in the research are based on autoencoders and Elastic Net - both related to artificial intelligence or machine-learning developed controllers. Due to the reduction of unnecessary features and removal of mutually correlated input variables generated by the tomography electrodes, good quality reconstructions of tomographic images were obtained. The simulation experiments proved that the presented methods could be effective in improving the quality of reconstructed tomographic images.

Download Full-text

Data Quality Considerations for Petrophysical Machine-Learning Models

Petrophysics – The SPWLA Journal of Formation Evaluation and Reservoir Description ◽

10.30632/pjv62n6-2020a1 ◽

2021 ◽

Vol 62 (6) ◽

pp. 585-613

Author(s):

Andrew McDonald ◽

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Quality ◽

Input Data ◽

Computational Time ◽

Well Log ◽

Learning Models ◽

Log Data ◽

Machine Learning Models

Decades of subsurface exploration and characterization have led to the collation and storage of large volumes of well-related data. The amount of data gathered daily continues to grow rapidly as technology and recording methods improve. With the increasing adoption of machine-learning techniques in the subsurface domain, it is essential that the quality of the input data is carefully considered when working with these tools. If the input data are of poor quality, the impact on precision and accuracy of the prediction can be significant. Consequently, this can impact key decisions about the future of a well or a field. This study focuses on well-log data, which can be highly multidimensional, diverse, and stored in a variety of file formats. Well-log data exhibits key characteristics of big data: volume, variety, velocity, veracity, and value. Well data can include numeric values, text values, waveform data, image arrays, maps, and volumes. All of which can be indexed by time or depth in a regular or irregular way. A significant portion of time can be spent gathering data and quality checking it prior to carrying out petrophysical interpretations and applying machine-learning models. Well-log data can be affected by numerous issues causing a degradation in data quality. These include missing data ranging from single data points to entire curves, noisy data from tool-related issues, borehole washout, processing issues, incorrect environmental corrections, and mislabeled data. Having vast quantities of data does not mean it can all be passed into a machine-learning algorithm with the expectation that the resultant prediction is fit for purpose. It is essential that the most important and relevant data are passed into the model through appropriate feature selection techniques. Not only does this improve the quality of the prediction, but it also reduces computational time and can provide a better understanding of how the models reach their conclusion. This paper reviews data quality issues typically faced by petrophysicists when working with well-log data and deploying machine-learning models. This is achieved by first providing an overview of machine learning and big data within the petrophysical domain, followed by a review of the common well-log data issues, their impact on machine-learning algorithms, and methods for mitigating their influence.

Download Full-text

A Literature Review Study of Software Defect Prediction using Machine Learning Techniques

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i6.286 ◽

2018 ◽

Vol 6 (6) ◽

pp. 300 ◽

Cited By ~ 3

Author(s):

Feidu Akmel ◽

Ermiyas Birihanu ◽

Bahir Siraj

Keyword(s):

Machine Learning ◽

Software Metrics ◽

Quality Standard ◽

Machine Learning Techniques ◽

Software Systems ◽

Health Care Insurance ◽

Software Defect ◽

Learning Techniques ◽

Software Product

Software systems are any software product or applications that support business domains such as Manufacturing,Aviation, Health care, insurance and so on.Software quality is a means of measuring how software is designed and how well the software conforms to that design. Some of the variables that we are looking for software quality are Correctness, Product quality, Scalability, Completeness and Absence of bugs, However the quality standard that was used from one organization is different from other for this reason it is better to apply the software metrics to measure the quality of software. Attributes that we gathered from source code through software metrics can be an input for software defect predictor. Software defect are an error that are introduced by software developer and stakeholders. Finally, in this study we discovered the application of machine learning on software defect that we gathered from the previous research works.

Download Full-text

Data science in economics: comprehensive review of advanced machine learning and deep learning methods

10.31232/osf.io/4pxq2 ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Puhong Duan ◽

Pedram Ghamisi ◽

Ferdinand Filip ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Science ◽

State Of The Art ◽

Science Methods ◽

Learning Models ◽

Diverse Range ◽

Hybrid Machine ◽

Economics Research

This paper provides a state-of-the-art investigation of advances in data science in emerging economic applications. The analysis was performed on novel data science methods in four individual classes of deep learning models, hybrid deep learning models, hybrid machine learning, and ensemble models. Application domains include a wide and diverse range of economics research from the stock market, marketing, and e-commerce to corporate banking and cryptocurrency. Prisma method, a systematic literature review methodology, was used to ensure the quality of the survey. The findings reveal that the trends follow the advancement of hybrid models, which, based on the accuracy metric, outperform other learning algorithms. It is further expected that the trends will converge toward the advancements of sophisticated hybrid deep learning models.

Download Full-text

Anomaly Detection in Market Data Structures Via Machine Learning Algorithms

SSRN Electronic Journal ◽

10.2139/ssrn.3516028 ◽

2020 ◽

Author(s):

Dirk Röder ◽

Henning Mueller

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Data Structures ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Market Data

Download Full-text

Pollutants in Organic Chemistry and Medicinal Chemistry Education Laboratory. Experimental and Machine Learning Studies

Current Topics in Medicinal Chemistry ◽

10.2174/1568026620666200211110043 ◽

2020 ◽

Vol 20 (9) ◽

pp. 720-730

Author(s):

Iker Montes-Bageneta ◽

Urtzi Akesolo ◽

Sara López ◽

Maria Merino ◽

Eneritz Anakabe ◽

...

Keyword(s):

Organic Chemistry ◽

Machine Learning ◽

Chemistry Education ◽

Organic Waste ◽

Computational Modelling ◽

University Education ◽

Academic Factors ◽

Academic Year ◽

Statistical Analysis Software

Aims: Computational modelling may help us to detect the more important factors governing this process in order to optimize it. Background: The generation of hazardous organic waste in teaching and research laboratories poses a big problem that universities have to manage. Methods: In this work, we report on the experimental measurement of waste generation on the chemical education laboratories within our department. We measured the waste generated in the teaching laboratories of the Organic Chemistry Department II (UPV/EHU), in the second semester of the 2017/2018 academic year. Likewise, to know the anthropogenic and social factors related to the generation of waste, a questionnaire has been utilized. We focused on all students of Experimentation in Organic Chemistry (EOC) and Organic Chemistry II (OC2) subjects. It helped us to know their prior knowledge about waste, awareness of the problem of separate organic waste and the correct use of the containers. These results, together with the volumetric data, have been analyzed with statistical analysis software. We obtained two Perturbation-Theory Machine Learning (PTML) models including chemical, operational, and academic factors. The dataset analyzed included 6050 cases of laboratory practices vs. practices of reference. Results: These models predict the values of acetone waste with R2 = 0.88 and non-halogenated waste with R2 = 0.91. Conclusion: This work opens a new gate to the implementation of more sustainable techniques and a circular economy with the aim of improving the quality of university education processes.

Download Full-text

Accuracy analysis in borders description for municipal entities in the Republic Bashkortostan

Geodesy and Cartography ◽

10.22389/0016-7126-2017-924-6-2-5 ◽

2017 ◽

Vol 924 (6) ◽

pp. 2-5

Author(s):

V.N. Puchkov ◽

R.S. Musalimov ◽

D.S. Zavarnov

Keyword(s):

Russian Federation ◽

Input Data ◽

Accuracy Analysis ◽

Rural Settlements ◽

Regulatory Requirements ◽

Republic Of Bashkortostan ◽

Weak Points ◽

The Republic

In this work the analysis on description of rural settlements boundaries of the Republic of Bashkortostan, based on the experience of other sub-federal units of Russian Federation was made. A range of weak points in collected input data was defined. In total, of 54 municipal districts of the Republic of Bashkortostan (818 rural settlements), 44 districts showed nonconformity of feed data details to regulatory requirements. And the main reason for this is a low quality of input materials such as base maps at scale 1

Download Full-text