scholarly journals Machine Learning and data mining tools applied for databases of low number of records

2022 ◽  
Vol 21 (4) ◽  
pp. 346-363
Author(s):  
Hubert Anysz

The use of data mining and machine learning tools is becoming increasingly common. Their usefulness is mainly noticeable in the case of large datasets, when information to be found or new relationships are extracted from information noise. The development of these tools means that datasets with much fewer records are being explored, usually associated with specific phenomena. This specificity most often causes the impossibility of increasing the number of cases, and that can facilitate the search for dependences in the phenomena under study. The paper discusses the features of applying the selected tools to a small set of data. Attempts have been made to present methods of data preparation, methods for calculating the performance of tools, taking into account the specifics of databases with a small number of records. The techniques selected by the author are proposed, which helped to break the deadlock in calculations, i.e., to get results much worse than expected. The need to apply methods to improve the accuracy of forecasts and the accuracy of classification was caused by a small amount of analysed data. This paper is not a review of popular methods of machine learning and data mining; nevertheless, the collected and presented material will help the reader to shorten the path to obtaining satisfactory results when using the described computational methods

2020 ◽  
Vol 10 (19) ◽  
pp. 6683
Author(s):  
Andrea Murari ◽  
Emmanuele Peluso ◽  
Michele Lungaroni ◽  
Riccardo Rossi ◽  
Michela Gelfusa ◽  
...  

The inadequacies of basic physics models for disruption prediction have induced the community to increasingly rely on data mining tools. In the last decade, it has been shown how machine learning predictors can achieve a much better performance than those obtained with manually identified thresholds or empirical descriptions of the plasma stability limits. The main criticisms of these techniques focus therefore on two different but interrelated issues: poor “physics fidelity” and limited interpretability. Insufficient “physics fidelity” refers to the fact that the mathematical models of most data mining tools do not reflect the physics of the underlying phenomena. Moreover, they implement a black box approach to learning, which results in very poor interpretability of their outputs. To overcome or at least mitigate these limitations, a general methodology has been devised and tested, with the objective of combining the predictive capability of machine learning tools with the expression of the operational boundary in terms of traditional equations more suited to understanding the underlying physics. The proposed approach relies on the application of machine learning classifiers (such as Support Vector Machines or Classification Trees) and Symbolic Regression via Genetic Programming directly to experimental databases. The results are very encouraging. The obtained equations of the boundary between the safe and disruptive regions of the operational space present almost the same performance as the machine learning classifiers, based on completely independent learning techniques. Moreover, these models possess significantly better predictive power than traditional representations, such as the Hugill or the beta limit. More importantly, they are realistic and intuitive mathematical formulas, which are well suited to supporting theoretical understanding and to benchmarking empirical models. They can also be deployed easily and efficiently in real-time feedback systems.


Author(s):  
Soodeh Hosseini ◽  
Saman Rafiee Sardo

Abstract With the growth of data mining and machine learning approaches in recent years, many efforts have been made to generalize these sciences so that researchers from any field can easily utilize these sciences. One of the most important of these efforts is the development of data mining tools that try to hide the complexities from researchers so that they can achieve a professional output with any level of knowledge. This paper is focused on reviewing and comparing data mining and machine learning tools including WEKA, KNIME, Keel, Orange, Azure, IBM SPSS Modeler, R and Scikit-Learn to show what approach each of these methods has taken in the face of the complexities and problems of different scenarios of generalization of data mining and machine learning. In addition, for a more detailed review, this paper examines the challenge of network intrusion detection in two tools, Knime with graphical interface and Scikit-Learn with coding environment.


2019 ◽  
Vol 55 (2) ◽  
pp. 621-651 ◽  
Author(s):  
Amit Bubna ◽  
Sanjiv R. Das ◽  
Nagpurnanand Prabhala

Although venture capitalists (VCs) can choose from thousands of potential syndicate partners, many co-syndicate with small groups of preferred partners. We term these groups “VC communities.” We apply computational methods from the physical sciences to 3 decades of syndication data to identify these communities. We find that communities comprise VCs that are similar in age, connectedness, and functional style but undifferentiated in spatial location. Machine-learning tools classify communities into 3 groups roughly ordered by their age and reach. Community VC financing is associated with faster maturation and greater innovation, especially for early-stage firms without an innovation history.


Author(s):  
Zdravko Pecar ◽  
Ivan Bratko

The aim of this research was to study the performance of 58 Slovenian administrative districts (state government offices at local level), to identify the factors that affect the performance, and how these effects interact. The main idea was to analyze the available statistical data relevant to the performance of the administrative districts with machine learning tools for data mining, and to extract from available data clear relations between various parameters of administrative districts and their performance. The authors introduced the concept of basic unit of administrative service, which enables the measurement of an administrative district’s performance. The main data mining tool used in this study was the method of regression tree induction. This method can handle numeric and discrete data, and has the benefit of providing clear insight into the relations between the parameters in the system, thereby facilitating the interpretation of the results of data mining. The authors investigated various relations between the parameters in their domain, for example, how the performance of an administrative district depends on the trends in the number of applications, employees’ level of professional qualification, etc. In the chapter, they report on a variety of (occasionally surprising) findings extracted from the data, and discuss how these findings can be used to improve decisions in managing administrative districts.


2017 ◽  
Vol 27 (09n10) ◽  
pp. 1579-1589 ◽  
Author(s):  
Reinier Morejón ◽  
Marx Viana ◽  
Carlos Lucena

Data mining is a hot topic that attracts researchers of different areas, such as database, machine learning, and agent-oriented software engineering. As a consequence of the growth of data volume, there is an increasing need to obtain knowledge from these large datasets that are very difficult to handle and process with traditional methods. Software agents can play a significant role performing data mining processes in ways that are more efficient. For instance, they can work to perform selection, extraction, preprocessing, and integration of data as well as parallel, distributed, or multisource mining. This paper proposes a framework based on multiagent systems to apply data mining techniques to health datasets. Last but not least, the usage scenarios that we use are datasets for hypothyroidism and diabetes and we run two different mining processes in parallel in each database.


Drug Safety ◽  
2003 ◽  
Vol 26 (5) ◽  
pp. 363-364 ◽  
Author(s):  
David E Lilienfeld ◽  
Savian Nicholas ◽  
Daniel J Macneil ◽  
Olga Kurjatkin ◽  
Thomas Gelardin

Nowadays, Data Mining is used everywhere for extracting information from the data and in turn, acquires knowledge for decision making. Data Mining analyzes patterns which are used to extract information and knowledge for making decisions. Many open source and licensed tools like Weka, RapidMiner, KNIME, and Orange are available for Data Mining and predictive analysis. This paper discusses about different tools available for Data Mining and Machine Learning, followed by the description, pros and cons of these tools. The article provides details of all the algorithms like classification, regression, characterization, discretization, clustering, visualization and feature selection for Data Mining and Machine Learning tools. It will help people for efficient decision making and suggests which tool is suitable according to their requirement.


An interference discovery framework is customizing that screens a singular or an arrangement of PCs for toxic activities that are away for taking or blue-penciling information or spoiling framework shows. The most methodology used as a piece of the present interference recognition framework is not prepared to deal with the dynamic and complex nature of computerized attacks on PC frameworks. In spite of the way that compelling adaptable methodologies like various frameworks of AI can realize higher discovery rates, cut down bogus alert rates and reasonable estimation and correspondence cost. The use of data mining can realize ceaseless model mining, request, gathering and littler than ordinary data stream. This examination paper portrays a connected with composing audit of AI and data delving procedures for advanced examination in the assistance of interference discovery. In perspective on the number of references or the congruity of a rising methodology, papers addressing each procedure were recognized, examined, and compacted. Since data is so fundamental in AI and data mining draws near, some striking advanced educational records used as a piece of AI and data burrowing are depicted for computerized security is shown, and a couple of recommendations on when to use a given system are given.


Sign in / Sign up

Export Citation Format

Share Document