data preparation
Recently Published Documents





2022 ◽  
Vol 21 (4) ◽  
pp. 346-363
Hubert Anysz

The use of data mining and machine learning tools is becoming increasingly common. Their usefulness is mainly noticeable in the case of large datasets, when information to be found or new relationships are extracted from information noise. The development of these tools means that datasets with much fewer records are being explored, usually associated with specific phenomena. This specificity most often causes the impossibility of increasing the number of cases, and that can facilitate the search for dependences in the phenomena under study. The paper discusses the features of applying the selected tools to a small set of data. Attempts have been made to present methods of data preparation, methods for calculating the performance of tools, taking into account the specifics of databases with a small number of records. The techniques selected by the author are proposed, which helped to break the deadlock in calculations, i.e., to get results much worse than expected. The need to apply methods to improve the accuracy of forecasts and the accuracy of classification was caused by a small amount of analysed data. This paper is not a review of popular methods of machine learning and data mining; nevertheless, the collected and presented material will help the reader to shorten the path to obtaining satisfactory results when using the described computational methods

2022 ◽  
Vol 5 (1) ◽  
Anne A. H. de Hond ◽  
Artuur M. Leeuwenberg ◽  
Lotty Hooft ◽  
Ilse M. J. Kant ◽  
Steven W. J. Nijman ◽  

AbstractWhile the opportunities of ML and AI in healthcare are promising, the growth of complex data-driven prediction models requires careful quality and applicability assessment before they are applied and disseminated in daily practice. This scoping review aimed to identify actionable guidance for those closely involved in AI-based prediction model (AIPM) development, evaluation and implementation including software engineers, data scientists, and healthcare professionals and to identify potential gaps in this guidance. We performed a scoping review of the relevant literature providing guidance or quality criteria regarding the development, evaluation, and implementation of AIPMs using a comprehensive multi-stage screening strategy. PubMed, Web of Science, and the ACM Digital Library were searched, and AI experts were consulted. Topics were extracted from the identified literature and summarized across the six phases at the core of this review: (1) data preparation, (2) AIPM development, (3) AIPM validation, (4) software development, (5) AIPM impact assessment, and (6) AIPM implementation into daily healthcare practice. From 2683 unique hits, 72 relevant guidance documents were identified. Substantial guidance was found for data preparation, AIPM development and AIPM validation (phases 1–3), while later phases clearly have received less attention (software development, impact assessment and implementation) in the scientific literature. The six phases of the AIPM development, evaluation and implementation cycle provide a framework for responsible introduction of AI-based prediction models in healthcare. Additional domain and technology specific research may be necessary and more practical experience with implementing AIPMs is needed to support further guidance.

Mingyang Wang ◽  
Haosheng Ye ◽  
Xueliang Wang ◽  
Zhuyong Li ◽  
Jie Sheng ◽  

Abstract The development of high temperature superconducting (HTS) conductors is leading to the diverse structure designs of HTS cable. (RE)Ba2Cu3Ox (REBCO) tapes using spiral geometry has been a popular compact HTS cable structure, which is in the critical stage of engineering production and application. However, the winding quality of REBCO tapes is unstable for spiral HTS cables, because of the different winding methods like manual winding, device-assisted winding, or automatic winding. Although automatic winding will be the first choice for the actual applications by spiral HTS cables, the related winding quality is not monitored effectively yet. In this paper, we first discuss the possible influence of the winding quality on the critical current performance of spiral HTS cables. Then, an artificial intelligence (AI) based method is implemented to realize the detection model for the winding quality. From image data preparation to AI detection and postprocessing, the detection model provides the final results to show the winding intervals as a binary image. Through the intuitive analysis and the evaluation metrics, both error and correct winding conditions obtain acceptable detection results, and the correct one has a better performance. The identification of the winding intervals will help to determine the monitoring strategy for the spiral HTS cable fabrication.

2022 ◽  
Vol 10 (1) ◽  
Meri Fitriani ◽  
Gigih Forda Nama ◽  
Mardiana Mardiana

Abstrak - UPT Perpustakaan Universitas Lampung merupakan UPT yang bergerak di bidang perpustakaan. Memiliki dua layanan berdasarkan interaksinya yaitu layanan teknis dan layanan pengguna. Saat ini UPT Perpustakaan Universitas Lampung memiliki buku yang tercetak sebanyak 142.776. Penelitian ini bertujuan menemukan pola association rule dengan teknik data mining memanfaatkan software RapidMiner 9.1 dalam penerapan algoritma Apriori. Metode penelitian Cross Industry Standar Process for Data Mining (CRISP-DM) dengan tahapan business understanding phase, data understanding phase, data preparation, modelling phase, evaluation phase dan deployment phase. Data yang digunakan dalam penelitian ini adalah data transaksi peminjaman buku dari tahun 2014 hingga 2017 dengan total data peminjaman buku sebanyak 170.115. Hasil pemodelan association rule dengan algoritma apriori menggunakan nilai support 0.3 dan nilai confidence 0.3 diperoleh judul buku “Metodologi pengajaran bahasa” akan meminjam “English for tourism :panduan berprofesi di dunia pariwisata” nilai support 1 dan confidence 1. Rekomendasi untuk pembelian buku disarankan mengikuti pattern lampiran hasil asosiasi.Kata kunci: UPT Perpustakaan Universitas Lampung, Data Peminjaman Buku, Data Mining, Association Rule, CRISP-DM.

2022 ◽  
Vol 6 ◽  
pp. 781-791
John Paul Miranda ◽  

Purpose–The dataset was collected to examine and identify possible key topicswithin these texts. Method–Data preparation such as data cleaning, transformation, tokenization, removal of stop wordsfrom both English and Filipino, and word stemmingwas employed in the datasetbefore feeding it to sentiment analysis and the LDA model.Results–The topmost occurring word within the dataset is "development" and there are three (3) likely topics from the speeches of Philippine presidents: economic development, enhancement of public services, and addressing challenges.Conclusion–The datasetwas ableto provide valuable insights contained among official documents. While the study showedthatpresidentshave used their annual address to express their visions for the country. It alsopresentedthat the presidents from 1935 to 2016 faced the same problems during their term.Recommendations–Future researchers may collect other speeches made by presidents during their term;combine them to the dataset used in this studyto furtherinvestigate these important textsby subjecting them to the same methodology used in this study.The dataset may be requested from the authors and it is recommended for further analysis. For example, determine how the speeches of the president reflect the preamble or foundations of the Philippine constitution.

2022 ◽  
pp. 27-54
Richard V. McCarthy ◽  
Mary M. McCarthy ◽  
Wendy Ceccucci

Osval Antonio Montesinos López ◽  
Abelardo Montesinos López ◽  
Jose Crossa

AbstractThis data preparation chapter is of paramount importance for implementing statistical machine learning methods for genomic selection. We present the basic linear mixed model that gives rise to BLUE and BLUP and explain how to decide when to use fixed or random effects that give rise to best linear unbiased estimates (BLUE or BLUEs) and best linear unbiased predictors (BLUP or BLUPs). The R codes for fitting linear mixed model for the data are given in small examples. We emphasize tools for computing BLUEs and BLUPs for many linear combinations of interest in genomic-enabled prediction and plant breeding. We present tools for cleaning, imputing, and detecting minor and major allele frequency computation, marker recodification, frequency of heterogeneous, frequency of NAs, and three methods for computing the genomic relationship matrix. In addition, scaling and data compression of inputs are important in statistical machine learning. For a more extensive description of linear mixed models, see Chap. 10.1007/978-3-030-89010-0_5.

Oksana Andriivna Tatarinova ◽  
Dmytro Vasylovych Breslavsky

The paper presents the formulation of a two-dimensional problem of the creep theory for the case of finite strains. A description of the foundations of the calculation method presents. The method is based on the use of the generalized Lagrange-Euler (ALE) approach, in which the boundary value problem in the current solid configuration is solved by using FEM. A triangular element is involved in the numerical modeling. At each stage of creep calculations in the current configuration, the initial problem is solved numerically using the finite difference method. The preprocessing data preparation is carried out in the homemade RD program, in which two-dimensional model is surrounded by a mesh of special elements. This feature implements the ALE algorithm for the motion of material elements along the model. The examples of preprocessing as well as of the mesh rebuilding in the case of finite elements transition are given. Creep calculations are performed in the developed program, which is based on the use of the FEM Creep software package in the case of finite strains. The regular mesh is used for calculations, which allow us to use the efficient algorithm for transition between current configurations. The numerical results of the creep of specimens made from aluminum alloys are compared with the experimental and calculated ones obtained by integrating the constitutive equations. It was concluded that for material with ductile type of fracture the presented method and software allow to obtain results very close to experimental only by use of creep rate equation. Creep simulations of material with mixed brittle-ductile fracture type demand use the additional equation for damage variable.

I. G. Fattakhov ◽  
L. S. Kuleshova ◽  
R. N. Bakhtizin ◽  
V. V. Mukhametshin ◽  

The purpose of the work is to substantiate and formulate the principles of data generation with multiple results of hydraulic fracturing (HF) modeling. Qualitative data for assessment, intercomparison and subsequent statistical analysis are characterized by a single numerical value for each considered hydraulic fracturing parameter. For a number of hydraulic fracturing technologies, uncertainty may arise due to obtaining several values for the parameter under consideration. The scientific novelty of the work lies in the substantiation of a new approach for evaluating the obtained data series during hydraulic fracturing modeling. A number of data can be obtained both during the formation and modeling of several hydraulic fractures, and for one fracture when calculating in different modules of the simulator. As a result, an integration technique was developed that allows forming a uniform data array regardless of the number of elements in the hydraulic fracturing modeling results. Keywords: hydraulic fracturing; acid-proppant hydraulic fracturing; hydraulic fracturing of layered rocks; hydraulic fracturing modeling; pseudo-three-dimensional fracture model; data preparation; statistical analysis.

AI & Society ◽  
2021 ◽  
Jan Kaiser ◽  
German Terrazas ◽  
Duncan McFarlane ◽  
Lavindra de Silva

AbstractMachine learning (ML) is increasingly used to enhance production systems and meet the requirements of a rapidly evolving manufacturing environment. Compared to larger companies, however, small- and medium-sized enterprises (SMEs) lack in terms of resources, available data and skills, which impedes the potential adoption of analytics solutions. This paper proposes a preliminary yet general approach to identify low-cost analytics solutions for manufacturing SMEs, with particular emphasis on ML. The initial studies seem to suggest that, contrarily to what is usually thought at first glance, SMEs seldom need digital solutions that use advanced ML algorithms which require extensive data preparation, laborious parameter tuning and a comprehensive understanding of the underlying problem. If an analytics solution does require learning capabilities, a ‘simple solution’, which we will characterise in this paper, should be sufficient.

Sign in / Sign up

Export Citation Format

Share Document