scholarly journals Using genetic algorithm for generating optimal data sets to automatic testing the program code

Author(s):  
K E Serdyukov ◽  
T V Avdeenko

In present paper we propose an approach to automatic generation of test data set based on application of the genetic algorithm. We consider original procedure for computation of the weights of code operations used to formulate the fitness function being the sum of these weights. Terminal objective and result of fitness function selection is maximization of code coverage by generated test data set. The idea of the genetic algorithm application approach is that first we choose the most complex branches of the program code for accounting in the fitness function. After taking the branch into account its weight is reset to zero in order to ensure maximum code coverage. By adjusting the algorithm, it is possible to ensure that the automatic test data generating algorithm finds the most distant from each other parts of the program code and, thus, the higher level of code coverage is attained. We give a detailed example illustrating the work and advantages of considered approach and suppose further improvements of the method.

2021 ◽  
Author(s):  
David Cotton ◽  

<p><strong>Introduction</strong></p><p>HYDROCOASTAL is a two year project funded by ESA, with the objective to maximise exploitation of SAR and SARin altimeter measurements in the coastal zone and inland waters, by evaluating and implementing new approaches to process SAR and SARin data from CryoSat-2, and SAR altimeter data from Sentinel-3A and Sentinel-3B. Optical data from Sentinel-2 MSI and Sentinel-3 OLCI instruments will also be used in generating River Discharge products.</p><p>New SAR and SARin processing algorithms for the coastal zone and inland waters will be developed and implemented and evaluated through an initial Test Data Set for selected regions. From the results of this evaluation a processing scheme will be implemented to generate global coastal zone and river discharge data sets.</p><p>A series of case studies will assess these products in terms of their scientific impacts.</p><p>All the produced data sets will be available on request to external researchers, and full descriptions of the processing algorithms will be provided</p><p> </p><p><strong>Objectives</strong></p><p>The scientific objectives of HYDROCOASTAL are to enhance our understanding  of interactions between the inland water and coastal zone, between the coastal zone and the open ocean, and the small scale processes that govern these interactions. Also the project aims to improve our capability to characterize the variation at different time scales of inland water storage, exchanges with the ocean and the impact on regional sea-level changes</p><p>The technical objectives are to develop and evaluate  new SAR  and SARin altimetry processing techniques in support of the scientific objectives, including stack processing, and filtering, and retracking. Also an improved Wet Troposphere Correction will be developed and evaluated.</p><p><strong>Project  Outline</strong></p><p>There are four tasks to the project</p><ul><li>Scientific Review and Requirements Consolidation: Review the current state of the art in SAR and SARin altimeter data processing as applied to the coastal zone and to inland waters</li> <li>Implementation and Validation: New processing algorithms with be implemented to generate a Test Data sets, which will be validated against models, in-situ data, and other satellite data sets. Selected algorithms will then be used to generate global coastal zone and river discharge data sets</li> <li>Impacts Assessment: The impact of these global products will be assess in a series of Case Studies</li> <li>Outreach and Roadmap: Outreach material will be prepared and distributed to engage with the wider scientific community and provide recommendations for development of future missions and future research.</li> </ul><p> </p><p><strong>Presentation</strong></p><p>The presentation will provide an overview to the project, present the different SAR altimeter processing algorithms that are being evaluated in the first phase of the project, and early results from the evaluation of the initial test data set.</p><p> </p>


2021 ◽  
Author(s):  
Louise Bloch ◽  
Christoph M. Friedrich

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.


Author(s):  
Gihong Kim ◽  
Bonghee Hong

The testing of RFID information services requires a test data set of business events comprising object, aggregation, quantity and transaction events. To generate business events, we need to address the performance issues in creating a large volume of event data. This paper proposes a new model for the tag life cycle and a fast generation algorithm for this model. We present the results of experiments with the generation algorithm, showing that it outperforms previous methods.


2013 ◽  
Vol 411-414 ◽  
pp. 1884-1893
Author(s):  
Yong Chun Cao ◽  
Ya Bin Shao ◽  
Shuang Liang Tian ◽  
Zheng Qi Cai

Due to many of the clustering algorithms based on GAs suffer from degeneracy and are easy to fall in local optima, a novel dynamic genetic algorithm for clustering problems (DGA) is proposed. The algorithm adopted the variable length coding to represent individuals and processed the parallel crossover operation in the subpopulation with individuals of the same length, which allows the DGA algorithm clustering to explore the search space more effectively and can automatically obtain the proper number of clusters and the proper partition from a given data set; the algorithm used the dynamic crossover probability and adaptive mutation probability, which prevented the dynamic clustering algorithm from getting stuck at a local optimal solution. The clustering results in the experiments on three artificial data sets and two real-life data sets show that the DGA algorithm derives better performance and higher accuracy on clustering problems.


Author(s):  
Yu Shi ◽  
Rolf D. Reitz

In a previous study (Shi, Y., and Reitz, R. D., 2008, “Assessment of Optimization Methodologies to Study the Effects of Bowl Geometry, Spray Targeting and Swirl Ratio for a Heavy-Duty Diesel Engine Operated at High-Load,” SAE Paper No. 2008-01-0949), nondominated sorting genetic algorithm II (NSGA II) (Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., 2002, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Trans. Evol. Comput., 6, pp. 182–197) performed better than other popular multiobjective genetic algorithms (MOGAs) in engine optimization that sought optimal combinations of the piston bowl geometry, spray targeting, and swirl ratio. NSGA II is further studied in this paper using different niching strategies that are applied to the objective space and design space, which diversify the optimal objectives and design parameters, accordingly. Convergence and diversity metrics are defined to assess the performance of NSGA II using different niching strategies. It was found that use of design niching achieved more diversified results with respect to design parameters, as expected. Regression was then conducted on the design data sets that were obtained from the optimizations with two niching strategies. Four regression methods, including K-nearest neighbors (KNs), kriging (KR), neural networks (NNs), and radial basis functions (RBFs), were compared. The results showed that the data set obtained from optimization with objective niching provided a more fitted learning space for the regression methods. KNs and KR outperformed the other two methods with respect to prediction accuracy. Furthermore, a log transformation to the objective space improved the prediction accuracy for the KN, KR, and NN methods, except the RBF method. The results indicate that it is appropriate to use a regression tool to partly replace the actual CFD evaluation tool in engine optimization designs using the genetic algorithm. This hybrid mode saves computational resources (processors) without losing optimal accuracy. A design of experiment (DoE) method (the optimal Latin hypercube method) was also used to generate a data set for the regression processes. However, the predicted results were much less reliable than the results that were learned using the dynamically increasing data sets from the NSGA II generations. Applying the dynamical learning strategy during the optimization processes allows computationally expensive CFD evaluations to be partly replaced by evaluations using the regression techniques. The present study demonstrates the feasibility of applying the hybrid mode to engine optimization problems, and the conclusions can also extend to other optimization studies (numerical or experimental) that feature time-consuming evaluations and have highly nonlinear objective spaces.


2013 ◽  
Vol 709 ◽  
pp. 616-619
Author(s):  
Jing Chen

This paper proposes a genetic algorithm-based method to generate test cases. This method provides information for test case generation using state machine diagrams. Its feature is realizing automation through fewer generated test cases. In terms of automatic generation of test data based on path coverage, the goal is to build a function that can excellently assess the generated test data and guide the genetic algorithms to find the targeting parameter values.


2021 ◽  
Vol 79 (1) ◽  
Author(s):  
Romana Haneef ◽  
Sofiane Kab ◽  
Rok Hrzic ◽  
Sonsoles Fuentes ◽  
Sandrine Fosse-Edorh ◽  
...  

Abstract Background The use of machine learning techniques is increasing in healthcare which allows to estimate and predict health outcomes from large administrative data sets more efficiently. The main objective of this study was to develop a generic machine learning (ML) algorithm to estimate the incidence of diabetes based on the number of reimbursements over the last 2 years. Methods We selected a final data set from a population-based epidemiological cohort (i.e., CONSTANCES) linked with French National Health Database (i.e., SNDS). To develop this algorithm, we adopted a supervised ML approach. Following steps were performed: i. selection of final data set, ii. target definition, iii. Coding variables for a given window of time, iv. split final data into training and test data sets, v. variables selection, vi. training model, vii. Validation of model with test data set and viii. Selection of the model. We used the area under the receiver operating characteristic curve (AUC) to select the best algorithm. Results The final data set used to develop the algorithm included 44,659 participants from CONSTANCES. Out of 3468 variables from SNDS linked to CONSTANCES cohort were coded, 23 variables were selected to train different algorithms. The final algorithm to estimate the incidence of diabetes was a Linear Discriminant Analysis model based on number of reimbursements of selected variables related to biological tests, drugs, medical acts and hospitalization without a procedure over the last 2 years. This algorithm has a sensitivity of 62%, a specificity of 67% and an accuracy of 67% [95% CI: 0.66–0.68]. Conclusions Supervised ML is an innovative tool for the development of new methods to exploit large health administrative databases. In context of InfAct project, we have developed and applied the first time a generic ML-algorithm to estimate the incidence of diabetes for public health surveillance. The ML-algorithm we have developed, has a moderate performance. The next step is to apply this algorithm on SNDS to estimate the incidence of type 2 diabetes cases. More research is needed to apply various MLTs to estimate the incidence of various health conditions.


Genes ◽  
2019 ◽  
Vol 10 (10) ◽  
pp. 778 ◽  
Author(s):  
Liu ◽  
Liu ◽  
Pan ◽  
Li ◽  
Yang ◽  
...  

For cancer diagnosis, many DNA methylation markers have been identified. However, few studies have tried to identify DNA methylation markers to diagnose diverse cancer types simultaneously, i.e., pan-cancers. In this study, we tried to identify DNA methylation markers to differentiate cancer samples from the respective normal samples in pan-cancers. We collected whole genome methylation data of 27 cancer types containing 10,140 cancer samples and 3386 normal samples, and divided all samples into five data sets, including one training data set, one validation data set and three test data sets. We applied machine learning to identify DNA methylation markers, and specifically, we constructed diagnostic prediction models by deep learning. We identified two categories of markers: 12 CpG markers and 13 promoter markers. Three of 12 CpG markers and four of 13 promoter markers locate at cancer-related genes. With the CpG markers, our model achieved an average sensitivity and specificity on test data sets as 92.8% and 90.1%, respectively. For promoter markers, the average sensitivity and specificity on test data sets were 89.8% and 81.1%, respectively. Furthermore, in cell-free DNA methylation data of 163 prostate cancer samples, the CpG markers achieved the sensitivity as 100%, and the promoter markers achieved 92%. For both marker types, the specificity of normal whole blood was 100%. To conclude, we identified methylation markers to diagnose pan-cancers, which might be applied to liquid biopsy of cancers.


Author(s):  
Tushar ◽  
Tushar ◽  
Shibendu Shekhar Roy ◽  
Dilip Kumar Pratihar

Clustering is a potential tool of data mining. A clustering method analyzes the pattern of a data set and groups the data into several clusters based on the similarity among themselves. Clusters may be either crisp or fuzzy in nature. The present chapter deals with clustering of some data sets using Fuzzy C-Means (FCM) algorithm and Entropy-based Fuzzy Clustering (EFC) algorithm. In FCM algorithm, the nature and quality of clusters depend on the pre-defined number of clusters, level of cluster fuzziness and a threshold value utilized for obtaining the number of outliers (if any). On the other hand, the quality of clusters obtained by the EFC algorithm is dependent on a constant used to establish the relationship between the distance and similarity of two data points, a threshold value of similarity and another threshold value used for determining the number of outliers. The clusters should ideally be distinct and at the same time compact in nature. Moreover, the number of outliers should be as minimum as possible. Thus, the above problem may be posed as an optimization problem, which will be solved using a Genetic Algorithm (GA). The best set of multi-dimensional clusters will be mapped into 2-D for visualization using a Self-Organizing Map (SOM).


Sign in / Sign up

Export Citation Format

Share Document