Minimum Database Determination and Preprocessing for Machine Learning

Author(s):  
Angel Fernando Kuri-Morales

The exploitation of large databases implies the investment of expensive resources both in terms of the storage and processing time. The correct assessment of the data implies that pre-processing steps be taken before its analysis. The transformation of categorical data by adequately encoding every instance of categorical variables is needed. Encoding must be implemented that preserves the actual patterns while avoiding the introduction of non-existing ones. The authors discuss CESAMO, an algorithm which allows us to statistically identify the pattern preserving codes. The resulting database is more economical and may encompass mixed databases. Thus, they obtain an optimal transformed representation that is considerably more compact without impairing its informational content. For the equivalence of the original (FD) and reduced data set (RD), they apply an algorithm that relies on a multivariate regression algorithm (AA). Through the combined application of CESAMO and AA, the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.

2016 ◽  
Vol 13 (4) ◽  
pp. 1-18
Author(s):  
Angel Ferrnando Kuri-Morales

The exploitation of large data bases frequently implies the investment of large and, usually, expensive resources both in terms of the storage and processing time required. It is possible to obtain equivalent reduced data sets where the statistical information of the original data may be preserved while dispensing with redundant constituents. Therefore, the physical embodiment of the relevant features of the data base is more economical. The author proposes a method where we may obtain an optimal transformed representation of the original data which is, in general, considerably more compact than the original without impairing its informational content. To certify the equivalence of the original data set (FD) and the reduced one (RD), the author applies an algorithm which relies in a Genetic Algorithm (GA) and a multivariate regression algorithm (AA). Through the combined application of GA and AA the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.


1998 ◽  
Vol 18 (7) ◽  
pp. 4337-4346 ◽  
Author(s):  
Vincent Colot ◽  
Vicki Haedens ◽  
Jean-Luc Rossignol

ABSTRACT Upon insertion, transposable elements can disrupt or alter gene function in various ways. Transposons moving through a cut-and-paste mechanism are in addition often mutagenic when excising because repair of the empty site seldom restores the original sequence. The characterization of numerous excision events in many eukaryotes indicates that transposon excision from a given site can generate a high degree of DNA sequence and phenotypic variation. Whether such variation is generated randomly remains largely to be determined. To this end, we have exploited a well-characterized system of genetic instability in the fungus Ascobolus immersus to perform an extensive study of excision events. We show that this system, which produces many phenotypically and genetically distinct derivatives, results from the excision of a novel Ds-like transposon,Ascot-1, from the spore color gene b2. A unique set of 48 molecularly distinct excision products were readily identified from a representative sample of excision derivatives. Products varied in their frequency of occurrence over 4 orders of magnitude, yet most showed small palindromic nucleotide additions. Based on these and other observations, compelling evidence was obtained for intermediate hairpin formation during the excision reaction and for strong biases in the subsequent processing steps at the empty site. Factors likely to be involved in these biases suggest new parallels between the excision reaction performed by transposons of thehAT family and V(D)J recombination. An evaluation of the contribution of small palindromic nucleotide additions produced by transposon excision to the spectrum of spontaneous mutations is also presented.


2019 ◽  
Vol 109 (01-02) ◽  
pp. 24-29
Author(s):  
E. Abele ◽  
T. Scherer ◽  
F. Geßner ◽  
M. Weigold

Additive Fertigungsverfahren zeichnen sich durch große Gestaltungsfreiheit aus, welche die Herstellung komplexer Bauteile ermöglicht. Angesichts hoher Fertigungskosten ist die Prozesssicherheit nachgeordneter Bearbeitungsschritte (wie zum Beispiel die Gewindefertigung) von großer Bedeutung. Der Artikel stellt die Ergebnisse einer Untersuchungsreihe vor, die unterschiedliche Ansätze der Gewindefertigung in Bauteilen aus Stahl behandelt, die mittels Selektivem Laserschmelzverfahren gefertigt wurden.   Additive manufacturing processes are characterized by a high degree of design freedom to enablet the production of complex components. To reduce manufacturing costs, the process reliability of downstream processing steps (e. g. thread production) is of great importance. This article presents the results of a series of investigations dealing with different approaches to thread production in steel components manufactured by selective laser melting


2018 ◽  
Vol 19 (12) ◽  
pp. 3780 ◽  
Author(s):  
Dingxuan He ◽  
Andrew Gichira ◽  
Zhizhong Li ◽  
John Nzei ◽  
Youhao Guo ◽  
...  

The order Nymphaeales, consisting of three families with a record of eight genera, has gained significant interest from botanists, probably due to its position as a basal angiosperm. The phylogenetic relationships within the order have been well studied; however, a few controversial nodes still remain in the Nymphaeaceae. The position of the Nuphar genus and the monophyly of the Nymphaeaceae family remain uncertain. This study adds to the increasing number of the completely sequenced plastid genomes of the Nymphaeales and applies a large chloroplast gene data set in reconstructing the intergeneric relationships within the Nymphaeaceae. Five complete chloroplast genomes were newly generated, including a first for the monotypic Euryale genus. Using a set of 66 protein-coding genes from the chloroplast genomes of 17 taxa, the phylogenetic position of Nuphar was determined and a monophyletic Nymphaeaceae family was obtained with convincing statistical support from both partitioned and unpartitioned data schemes. Although genomic comparative analyses revealed a high degree of synteny among the chloroplast genomes of the ancient angiosperms, key minor variations were evident, particularly in the contraction/expansion of the inverted-repeat regions and in RNA-editing events. Genome structure, and gene content and arrangement were highly conserved among the chloroplast genomes. The intergeneric relationships defined in this study are congruent with those inferred using morphological data.


2018 ◽  
Vol 35 (8) ◽  
pp. 1508-1518
Author(s):  
Rosembergue Pereira Souza ◽  
Luiz Fernando Rust da Costa Carmo ◽  
Luci Pirmez

Purpose The purpose of this paper is to present a procedure for finding unusual patterns in accredited tests using a rapid processing method for analyzing video records. The procedure uses the temporal differencing technique for object tracking and considers only frames not identified as statistically redundant. Design/methodology/approach An accreditation organization is responsible for accrediting facilities to undertake testing and calibration activities. Periodically, such organizations evaluate accredited testing facilities. These evaluations could use video records and photographs of the tests performed by the facility to judge their conformity to technical requirements. To validate the proposed procedure, a real-world data set with video records from accredited testing facilities in the field of vehicle safety in Brazil was used. The processing time of this proposed procedure was compared with the time needed to process the video records in a traditional fashion. Findings With an appropriate threshold value, the proposed procedure could successfully identify video records of fraudulent services. Processing time was faster than when a traditional method was employed. Originality/value Manually evaluating video records is time consuming and tedious. This paper proposes a procedure to rapidly find unusual patterns in videos of accredited tests with a minimum of manual effort.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2014 ◽  
Vol 14 (1) ◽  
pp. 351
Author(s):  
Jennifer Martínez Ferrero ◽  
Beatriz Cuadrado Ballesteros ◽  
Marco Antonio Figueiredo Milani Filho

<p>According to Dechow and Dichev (2002) and Lin and Wu (2014), a high degree of earnings management (EM) is associated with a poor quality of information. In this sense, it is possible to assume that the financial data of companies that manage earnings can present different patterns from those with low degree of EM. The aim of this exploratory study is to test whether a financial data set (operating expenses) of companies with high degree of EM presents bias. For this analysis, we used the model of Kothari and the modified model of Jones (“Dechow model” hereafter) to estimate the degree of EM, and we used the logarithmic distribution of data predicted by the Benford’s Law to detect abnormal patterns of digits in number sets. The sample was composed of 845 international listed non-financial companies for the year 2010. To analyze the discrepancies between the actual and expected frequencies of the significant-digit, two statistics were calculated: Z-test and Pearson’s chi-square test. The results show that, with a confidence level of 90%, the companies with a high degree of EM according to the Kothari model presented similar distribution to that one predicted by the Benford’s Law, suggesting that, in a preliminary analysis, their financial data are free from bias. On the other hand, the data set of the organizations that manage earnings according to the Dechow model presented abnormal patterns. The Benford´s Law has been implemented to successfully detect manipulated data. These results offer insights into the interactions between EM and patterns of financial data, and stimulate new comparative studies about the accuracy of models to estimate EM.</p><p>Keywords:<strong> </strong>Earnings management (EM). Financial Reporting Quality (FRQ). Benford’s Law.</p>


Genetika ◽  
2014 ◽  
Vol 46 (2) ◽  
pp. 545-559 ◽  
Author(s):  
Mirjana Jankulovska ◽  
Sonja Ivanovska ◽  
Ana Marjanovic-Jeromela ◽  
Snjezana Bolaric ◽  
Ljupcho Jankuloski ◽  
...  

In this study, the use of different multivariate approaches to classify rapeseed genotypes based on quantitative traits has been presented. Tree regression analysis, PCA analysis and two-way cluster analysis were applied in order todescribe and understand the extent of genetic variability in spring rapeseed genotype by trait data. The traits which highly influenced seed and oil yield in rapeseed were successfully identified by the tree regression analysis. Principal predictor for both response variables was number of pods per plant (NP). NP and 1000 seed weight could help in the selection of high yielding genotypes. High values for both traits and oil content could lead to high oil yielding genotypes. These traits may serve as indirect selection criteria and can lead to improvement of seed and oil yield in rapeseed. Quantitative traits that explained most of the variability in the studied germplasm were classified using principal component analysis. In this data set, five PCs were identified, out of which the first three PCs explained 63% of the total variance. It helped in facilitating the choice of variables based on which the genotypes? clustering could be performed. The two-way cluster analysissimultaneously clustered genotypes and quantitative traits. The final number of clusters was determined using bootstrapping technique. This approach provided clear overview on the variability of the analyzed genotypes. The genotypes that have similar performance regarding the traits included in this study can be easily detected on the heatmap. Genotypes grouped in the clusters 1 and 8 had high values for seed and oil yield, and relatively short vegetative growth duration period and those in cluster 9, combined moderate to low values for vegetative growth duration and moderate to high seed and oil yield. These genotypes should be further exploited and implemented in the rapeseed breeding program. The combined application of these multivariate methods can assist in deciding how, and based on which traits to select the genotypes, especially in early generations, at the beginning of a breeding program.


Author(s):  
Mekides Assefa Abebe ◽  
Jon Yngve Hardeberg ◽  
Gunnar Vartdal

In recent years, smartphone-based colour imaging systems are being increasingly used for Neonatal jaundice detection applications. These systems are based on the estimation of bilirubin concentration levels that correlates with newborns’ skin colour images corresponding to total serum bilirubin (TSB) and transcutaneous bilirubinometry (TcB) measurements. However, the colour reproduction capacity of smartphone cameras are known to be influenced by various factors including the technological and acquisition process variabilities. To make an accurate bilirubin estimation, irrespective of the type of smartphone and illumination conditions used to capture the newborns’ skin images, an inclusive and complete model, or data set, which can represent all the possible real world acquisitions scenarios needs to be utilized. Due to various challenges in generating such a model or a data set, some solutions tend towards the application of reduced data set (designed for reference conditions and devices only) and colour correction systems (for the transformation of other smartphone skin images to the reference space). Such approaches will make the bilirubin estimation methods highly dependent on the accuracy of their employed colour correction systems, and the capability of reducing device-to-device colour reproduction variability. However, the state-of-the-art methods with similar methodologies were only evaluated and validated on a single smartphone camera. The vulnerability of the systems in making an incorrect jaundice diagnosis can only be shown with a thorough investigation of the colour reproduction variability for extended number of smartphones and illumination conditions. Accordingly, this work presents and discuss the results of such broad investigation, including the evaluation of seven smartphone cameras, ten light sources, and three different colour correction approaches. The overall results show statistically significant colour differences among devices, even after colour correction applications, and that further analysis on clinically significance of such differences is required for skin colour based jaundice diagnosis.


2018 ◽  
pp. 1773-1791 ◽  
Author(s):  
Prateek Pandey ◽  
Shishir Kumar ◽  
Sandeep Shrivastava

In recent years, there has been a growing interest in Time Series forecasting. A number of time series forecasting methods have been proposed by various researchers. However, a common trend found in these methods is that they all underperform on a data set that exhibit uneven ups and downs (turbulences). In this paper, a new method based on fuzzy time-series (henceforth FTS) to forecast on the fundament of turbulences in the data set is proposed. The results show that the turbulence based fuzzy time series forecasting is effective, especially, when the available data indicate a high degree of instability. A few benchmark FTS methods are identified from the literature, their limitations and gaps are discussed and it is observed that the proposed method successfully overcome their deficiencies to produce better results. In order to validate the proposed model, a performance comparison with various conventional time series models is also presented.


Sign in / Sign up

Export Citation Format

Share Document