Using "Blackbox" Algorithms Such AS TreeNET and Random Forests for Data-Ming and for Finding Meaningful Patterns, Relationships and Outliers in Complex Ecological Data

2009 ◽  
pp. 65-84 ◽  
Author(s):  
Erica Craig ◽  
Falk Huettmann

The use of machine-learning algorithms capable of rapidly completing intensive computations may be an answer to processing the sheer volumes of highly complex data available to researchers in the field of ecology. In spite of this, the continued use of less effective, simple linear, and highly labor intensive techniques such as stepwise multiple regression continue to be widespread in the ecological community. Herein we describe the use of data-mining algorithms such as TreeNet and Random Forests (Salford Systems), which can rapidly and accurately identify meaningful patterns and relationships in subsets of data that carry various degrees of outliers and uncertainty. We use satellite data from a wintering Golden Eagle as an example application; judged by the consistency of the results, the resultant models are robust, in spite of 30 % faulty presence data. The authors believe that the implications of these findings are potentially far-reaching and that linking computational software with wildlife ecology and conservation management in an interdisciplinary framework cannot only be a powerful tool, but is crucial toward obtaining sustainability.

BioResources ◽  
2021 ◽  
Vol 16 (3) ◽  
pp. 4891-4904
Author(s):  
Selahattin Bardak ◽  
Timucin Bardak ◽  
Hüseyin Peker ◽  
Eser Sözen ◽  
Yildiz Çabuk

Wood materials have been used in many products such as furniture, stairs, windows, and doors for centuries. There are differences in methods used to adapt wood to ambient conditions. Impregnation is a widely used method of wood preservation. In terms of efficiency, it is critical to optimize the parameters for impregnation. Data mining techniques reduce most of the cost and operational challenges with accurate prediction in the wood industry. In this study, three data-mining algorithms were applied to predict bending strength in impregnated wood materials (Pinus sylvestris L. and Millettia laurentii). Models were created from real experimental data to examine the relationship between bending strength, diffusion time, vacuum duration, and wood type, based on decision trees (DT), random forest (RF), and Gaussian process (GP) algorithms. The highest bending strength was achieved with wenge (Millettia laurentii) wood in 10 bar vacuum and the diffusion condition during 25 min. The results showed that all algorithms are suitable for predicting bending strength. The goodness of fit for the testing phase was determined as 0.994, 0.986, and 0.989 in the DT, RF, and GP algorithms, respectively. Moreover, the importance of attributes was determined in the algorithms.


2021 ◽  
Vol 297 ◽  
pp. 01032
Author(s):  
Harish Kumar ◽  
Anshal Prasad ◽  
Ninad Rane ◽  
Nilay Tamane ◽  
Anjali Yeole

Phishing is a common attack on credulous people by making them disclose their unique information. It is a type of cyber-crime where false sites allure exploited people to give delicate data. This paper deals with methods for detecting phishing websites by analyzing various features of URLs by Machine learning techniques. This experimentation discusses the methods used for detection of phishing websites based on lexical features, host properties and page importance properties. We consider various data mining algorithms for evaluation of the features in order to get a better understanding of the structure of URLs that spread phishing. To protect end users from visiting these sites, we can try to identify the phishing URLs by analyzing their lexical and host-based features.A particular challenge in this domain is that criminals are constantly making new strategies to counter our defense measures. To succeed in this contest, we need Machine Learning algorithms that continually adapt to new examples and features of phishing URLs.


2009 ◽  
pp. 2000-2009
Author(s):  
J. J. Dolado ◽  
D. Rodríguez ◽  
J. Riquelme ◽  
F. Ferrer-Troyano ◽  
J. J. Cuadrado

One of the problems found in generic project databases, where the data is collected from different organizations, is the large disparity of its instances. In this chapter, we characterize the database selecting both attributes and instances so that project managers can have a better global vision of the data they manage. To achieve that, we first make use of data mining algorithms to create clusters. From each cluster, instances are selected to obtain a final subset of the database. The result of the process is a smaller database which maintains the prediction capability and has a lower number of instances and attributes than the original, yet allow us to produce better predictions.


2015 ◽  
Vol 813-814 ◽  
pp. 1104-1113 ◽  
Author(s):  
A. Sumesh ◽  
Dinu Thomas Thekkuden ◽  
Binoy B. Nair ◽  
K. Rameshkumar ◽  
K. Mohandas

The quality of weld depends upon welding parameters and exposed environment conditions. Improper selection of welding process parameter is one of the important reasons for the occurrence of weld defect. In this work, arc sound signals are captured during the welding of carbon steel plates. Statistical features of the sound signals are extracted during the welding process. Data mining algorithms such as Naive Bayes, Support Vector Machines and Neural Network were used to classify the weld conditions according to the features of the sound signal. Two weld conditions namely good weld and weld with defects namely lack of fusion, and burn through were considered in this study. Classification efficiencies of machine learning algorithms were compared. Neural network is found to be producing better classification efficiency comparing with other algorithms considered in this study.


Blood ◽  
2010 ◽  
Vol 116 (12) ◽  
pp. 2127-2133 ◽  
Author(s):  
Jessica A. Reese ◽  
Xiaoning Li ◽  
Manfred Hauben ◽  
Richard H. Aster ◽  
Daniel W. Bougie ◽  
...  

Abstract Drug-induced immune thrombocytopenia (DITP) is often suspected in patients with acute thrombocytopenia unexplained by other causes, but documenting that a drug is the cause of thrombocytopenia can be challenging. To provide a resource for diagnosis of DITP and for drug safety surveillance, we analyzed 3 distinct methods for identifying drugs that may cause thrombocytopenia. (1) Published case reports of DITP have described 253 drugs suspected of causing thrombocytopenia; using defined clinical criteria, 87 (34%) were identified with evidence that the drug caused thrombocytopenia. (2) Serum samples from patients with suspected DITP were tested for 202 drugs; drug-dependent, platelet-reactive antibodies were identified for 67 drugs (33%). (3) The Food and Drug Administration's Adverse Event Reporting System database was searched for drugs associated with thrombocytopenia by use of data mining algorithms; 1444 drugs had at least 1 report associated with thrombocytopenia, and 573 (40%) drugs demonstrated a statistically distinctive reporting association with thrombocytopenia. Among 1468 drugs suspected of causing thrombocytopenia, 102 were evaluated by all 3 methods, and 23 of these 102 drugs had evidence for an association with thrombocytopenia by all 3 methods. Multiple methods, each with a distinct perspective, can contribute to the identification of drugs that can cause thrombocytopenia.


2020 ◽  
Vol 8 (6) ◽  
pp. 1973-1979

The data mining algorithms functioning is main concern, when the data becomes to a greater extent. Clustering analysis is a active and dispute research direction in the region of data mining for complex data samples. DBSCAN is a density-based clustering algorithm with several advantages in numerous applications. However, DBSCAN has quadratic time complexity i.e. making it complicated for realistic applications particularly with huge complex data samples. Therefore, this paper recommended a hybrid approach to reduce the time complexity by exploring the core properties of the DBSCAN in the initial stage using genetic based K-means partition algorithm. The technological experiments showed that the proposed hybrid approach obtains competitive results when compared with the usual approach and drastically improves the computational time.


passer ◽  
2019 ◽  
Vol 3 (1) ◽  
pp. 174-179
Author(s):  
Noor Bahjat ◽  
Snwr Jamak

Cancer is a common disease that threats the life of one of every three people. This dangerous disease urgently requires early detection and diagnosis. The recent progress in data mining methods, such as classification, has proven the need for machine learning algorithms to apply to large datasets. This paper mainly aims to utilise data mining techniques to classify cancer data sets into blood cancer and non-blood cancer based on pre-defined information and post-defined information obtained after blood tests and CT scan tests. This research conducted using the WEKA data mining tool with 10-fold cross-validation to evaluate and compare different classification algorithms, extract meaningful information from the dataset and accurately identify the most suitable and predictive model. This paper depicted that the most suitable classifier with the best ability to predict the cancerous dataset is Multilayer perceptron with an accuracy of 99.3967%.


Author(s):  
J. J. Dolado ◽  
D. Rodríguez ◽  
J. Riquelme ◽  
F. Ferrer-Troyano ◽  
J. J. Cuadrado

One of the problems found in generic project databases, where the data is collected from different organizations, is the large disparity of its instances. In this chapter, we characterize the database selecting both attributes and instances so that project managers can have a better global vision of the data they manage. To achieve that, we first make use of data mining algorithms to create clusters. From each cluster, instances are selected to obtain a final subset of the database. The result of the process is a smaller database which maintains the prediction capability and has a lower number of instances and attributes than the original, yet allow us to produce better predictions.


Author(s):  
Balaji Rajagopalan ◽  
Ravi Krovi

Data mining is the process of sifting through the mass of organizational (internal and external) data to identify patterns critical for decision support. Successful implementation of the data mining effort requires a careful assessment of the various tools and algorithms available. The basic premise of this study is that machine-learning algorithms, which are assumption free, should outperform their traditional counterparts when mining business databases. The objective of this study is to test this proposition by investigating the performance of the algorithms for several scenarios. The scenarios are based on simulations designed to reflect the extent to which typical statistical assumptions are violated in the business domain. The results of the computational experiments support the proposition that machine learning algorithms generally outperform their statistical counterparts under certain conditions. These can be used as prescriptive guidelines for the applicability of data mining techniques.


Sign in / Sign up

Export Citation Format

Share Document