Construction of a century solar chromosphere data set for solar activity related research

This article introduces our ongoing project “Construction of a Century Solar Chromosphere Data Set for Solar Activity Related Research”. Solar activities are the major sources of space weather that affects human lives. Some of the serious space weather consequences, for instance, include interruption of space communication and navigation, compromising the safety of astronauts and satellites, and damaging power grids. Therefore, the solar activity research has both scientific and social impacts. The major database is built up from digitized and standardized film data obtained by several observatories around the world and covers a timespan more than 100 years. After careful calibration, we will develop feature extraction and data mining tools and provide them together with the comprehensive database for the astronomical community. Our final goal is to address several physical issues: filament behavior in solar cycles, abnormal behavior of solar cycle 24, large-scale solar eruptions, and sympathetic remote brightenings. Significant progresses are expected in data mining algorithms and software development, which will benefit the scientific analysis and eventually advance our understanding of solar cycles.

Download Full-text

Construction of a century solar chromosphere data set for solar activity related research

Solnechno-Zemnaya Fizika ◽

10.12737/22609 ◽

2017 ◽

Vol 3 (2) ◽

pp. 5-9

Author(s):

Линь Ганхуа ◽

Lin Ganghua ◽

Ван Сяо-Фань ◽

Wang Xiao Fan ◽

Ян Сяо ◽

...

Keyword(s):

Data Mining ◽

Solar Activity ◽

Space Weather ◽

Large Scale ◽

Abnormal Behavior ◽

Solar Chromosphere ◽

Solar Cycles ◽

Data Set ◽

Related Research ◽

Data Mining Algorithms

Download Full-text

Unsupervised Large‐Scale Search for Similar Earthquake Signals

Bulletin of the Seismological Society of America ◽

10.1785/0120190006 ◽

2019 ◽

Vol 109 (4) ◽

pp. 1451-1468 ◽

Cited By ~ 1

Author(s):

Clara E. Yoon ◽

Karianne J. Bergen ◽

Kexin Rong ◽

Hashem Elezabi ◽

William L. Ellsworth ◽

...

Keyword(s):

Data Mining ◽

Nuclear Power ◽

Large Scale ◽

Seismic Network ◽

Continuous Data ◽

Data Sets ◽

Data Set ◽

Data Mining Algorithms ◽

Seismicity Rates ◽

Unsupervised Data Mining

Abstract Seismology has continuously recorded ground‐motion spanning up to decades. Blind, uninformed search for similar‐signal waveforms within this continuous data can detect small earthquakes missing from earthquake catalogs, yet doing so with naive approaches is computationally infeasible. We present results from an improved version of the Fingerprint And Similarity Thresholding (FAST) algorithm, an unsupervised data‐mining approach to earthquake detection, now available as open‐source software. We use FAST to search for small earthquakes in 6–11 yr of continuous data from 27 channels over an 11‐station local seismic network near the Diablo Canyon nuclear power plant in central California. FAST detected 4554 earthquakes in this data set, with a 7.5% false detection rate: 4134 of the detected events were previously cataloged earthquakes located across California, and 420 were new local earthquake detections with magnitudes −0.3≤ML≤2.4, of which 224 events were located near the seismic network. Although seismicity rates are low, this study confirms that nearby faults are active. This example shows how seismology can leverage recent advances in data‐mining algorithms, along with improved computing power, to extract useful additional earthquake information from long‐duration continuous data sets.

Download Full-text

Comparative Analysis of Kohonen-SOM and K-Means data mining algorithms based on Academic Activities

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v6i1.4449 ◽

2013 ◽

Vol 6 (1) ◽

pp. 237-241

Author(s):

Shaina Dhingra ◽

Rimple Gilhotra ◽

Ravishanker Ravishanker

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Large Scale ◽

Decision Makers ◽

Data Set ◽

Data Mining Algorithms ◽

Academic Activities ◽

Increasing Demand ◽

Kohonen Som ◽

Mining Algorithms

With the increasing demand of IT and subsequent growth in this sector, the high- dimensional data came into existence. Data Mining plays an important role in analyzing and extracting the useful information. The key information which is extracted from a huge pool of data is useful for decision makers. Clustering, one of the techniques of data mining is the mostly used methods of analyzing the data. In this paper, the approach of Kohonen SOM and K-Means and HAC are discussed. After that these three methods are used for analyzing the academic data set of the faculty members of particular university. Finally a comparative analysis of these algorithms are done against some parameters like number of clusters, error rate and accessing rate, etc. Â This work will present new and improved results from large-scale datasets.

Download Full-text

A comparative study of Distributed Large Scale Data Mining Algorithms

BSSS Journal of Computer ◽

10.51767/jc1102 ◽

2020 ◽

Author(s):

Isha Sood ◽

Varsha Sharma

Keyword(s):

Data Mining ◽

Large Scale ◽

Large Data ◽

Data Sets ◽

Data Set ◽

Data Mining Algorithms ◽

Large Scale Data ◽

Mapreduce Model ◽

Mining Algorithms ◽

Scale Data

Essentially, data mining concerns the computation of data and the identification of patterns and trends in the information so that we might decide or judge. Data mining concepts have been in use for years, but with the emergence of big data, they are even more common. In particular, the scalable mining of such large data sets is a difficult issue that has attached several recent findings. A few of these recent works use the MapReduce methodology to construct data mining models across the data set. In this article, we examine current approaches to large-scale data mining and compare their output to the MapReduce model. Based on our research, a system for data mining that combines MapReduce and sampling is implemented and addressed

Download Full-text

A Survey of Feature Selection Techniques

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch289 ◽

2011 ◽

pp. 1888-1895 ◽

Cited By ~ 17

Author(s):

Barak Chizi ◽

Lior Rokach ◽

Oded Maimon

Keyword(s):

Data Mining ◽

Feature Selection ◽

Mining Method ◽

Data Set ◽

Data Mining Method ◽

Data Mining Algorithms ◽

Wrapper Approach ◽

Computationally Intensive ◽

Filter Approach ◽

Mining Algorithms

Dimensionality (i.e., the number of data set attributes or groups of attributes) constitutes a serious obstacle to the efficiency of most data mining algorithms (Maimon and Last, 2000). The main reason for this is that data mining algorithms are computationally intensive. This obstacle is sometimes known as the “curse of dimensionality” (Bellman, 1961). The objective of Feature Selection is to identify features in the data-set as important, and discard any other feature as irrelevant and redundant information. Since Feature Selection reduces the dimensionality of the data, data mining algorithms can be operated faster and more effectively by using Feature Selection. In some cases, as a result of feature selection, the performance of the data mining method can be improved. The reason for that is mainly a more compact, easily interpreted representation of the target concept. The filter approach (Kohavi , 1995; Kohavi and John ,1996) operates independently of the data mining method employed subsequently -- undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. A sub-category of filter methods that will be refer to as rankers, are methods that employ some criterion to score each feature and provide a ranking. From this ordering, several feature subsets can be chosen by manually setting There are three main approaches for feature selection: wrapper, filter and embedded. The wrapper approach (Kohavi, 1995; Kohavi and John,1996), uses an inducer as a black box along with a statistical re-sampling technique such as cross-validation to select the best feature subset according to some predictive measure. The embedded approach (see for instance Guyon and Elisseeff, 2003) is similar to the wrapper approach in the sense that the features are specifically selected for a certain inducer, but it selects the features in the process of learning.

Download Full-text

A Comparative Evaluation of Meta Classification Algorithms with Smokers Lung Data

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.36.24543 ◽

2018 ◽

Vol 7 (4.36) ◽

pp. 845 ◽

Cited By ~ 1

Author(s):

K. Kavitha ◽

K. Rohini ◽

G. Suseendran

Keyword(s):

Data Mining ◽

Web Mining ◽

Lung Function Test ◽

Performance Comparison ◽

Business Organizations ◽

Science Data ◽

Data Set ◽

Data Mining Algorithms ◽

Report Data ◽

Mining Algorithms

Data mining is the course of process during which knowledge is extracted through interesting patterns recognized from large amount of data. It is one of the knowledge exploring areas which is widely used in the field of computer science. Data mining is an inter-disciplinary area which has great impact on various other fields such as data analytics in business organizations, medical forecasting and diagnosis, market analysis, statistical analysis and forecasting, predictive analysis in various other fields. Data mining has multiple forms such as text mining, web mining, visual mining, spatial mining, knowledge mining and distributed mining. In general the process of data mining has many tasks from pre-processing. The actual task of data mining starts after the preprocessing task. This work deals with the analysis and comparison of the various Data mining algorithms particularly Meta classifiers based upon performance and accuracy. This work is under medical domain, which is using the lung function test report data along with the smoking data. This medical data set has been created from the raw data obtained from the hospital. In this paper work, we have analyzed the performance of Meta classifiers for classifying the files. Initially the performances of Meta and Rule classifiers are analyzed observed and found that the Meta classifier is more efficient than the Rule classifiers in Weka tool. The implementation work then continued with the performance comparison between the different types of classification algorithm among which the Meta classifiers showed comparatively higher accuracy in the process of classification. The four Meta classifier algorithms which are widely explored using the Weka tool namely Bagging, Attribute Selected Classifier, Logit Boost and Classification via Regression are used to classify this medical dataset and the result so obtained has been evaluated and compared to recognize the best among the classifier.

Download Full-text

Automated Phenotyping Tool for Identifying Developmental Language Disorder Cases in Health Systems Data (APT-DLD): A New Research Algorithm for Deployment in Large-Scale Electronic Health Record Systems

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-19-00397 ◽

2020 ◽

Vol 63 (9) ◽

pp. 3019-3035

Author(s):

Courtney E. Walters ◽

Rachana Nitin ◽

Katherine Margulis ◽

Olivia Boorom ◽

Daniel E. Gustavson ◽

...

Keyword(s):

Data Mining ◽

Large Scale ◽

Language Disorder ◽

Developmental Language Disorder ◽

Replication Sample ◽

Data Mining Algorithms ◽

Discovery Sample ◽

Automated Phenotyping ◽

Mining Algorithms

Purpose Data mining algorithms using electronic health records (EHRs) are useful in large-scale population-wide studies to classify etiology and comorbidities ( Casey et al., 2016 ). Here, we apply this approach to developmental language disorder (DLD), a prevalent communication disorder whose risk factors and epidemiology remain largely undiscovered. Method We first created a reliable system for manually identifying DLD in EHRs based on speech-language pathologist (SLP) diagnostic expertise. We then developed and validated an automated algorithmic procedure, called, Automated Phenotyping Tool for identifying DLD cases in health systems data (APT-DLD), that classifies a DLD status for patients within EHRs on the basis of ICD (International Statistical Classification of Diseases and Related Health Problems) codes. APT-DLD was validated in a discovery sample ( N = 973) using expert SLP manual phenotype coding as a gold-standard comparison and then applied and further validated in a replication sample of N = 13,652 EHRs. Results In the discovery sample, the APT-DLD algorithm correctly classified 98% (concordance) of DLD cases in concordance with manually coded records in the training set, indicating that APT-DLD successfully mimics a comprehensive chart review. The output of APT-DLD was also validated in relation to independently conducted SLP clinician coding in a subset of records, with a positive predictive value of 95% of cases correctly classified as DLD. We also applied APT-DLD to the replication sample, where it achieved a positive predictive value of 90% in relation to SLP clinician classification of DLD. Conclusions APT-DLD is a reliable, valid, and scalable tool for identifying DLD cohorts in EHRs. This new method has promising public health implications for future large-scale epidemiological investigations of DLD and may inform EHR data mining algorithms for other communication disorders. Supplemental Material https://doi.org/10.23641/asha.12753578

Download Full-text

Determination of Voting Tendencies in Turkey through Data Mining Algorithms

International Journal of E-Adoption ◽

10.4018/ijea.2017010105 ◽

2017 ◽

Vol 9 (1) ◽

pp. 50-58

Author(s):

Ali Bayır ◽

Sebnem Ozdemir ◽

Sevinç Gülseçen

Keyword(s):

Data Mining ◽

Economic Status ◽

Educational Status ◽

Data Set ◽

Data Mining Algorithms ◽

C4.5 Decision Tree ◽

Survey Results ◽

Mining Methods ◽

Using Data ◽

Mining Algorithms

Political elections can be defined as systems that contain political tendencies and voters' perceptions and preferences. The outputs of those systems are formed by specific attributes of individuals such as age, gender, occupancy, educational status, socio-economic status, religious belief, etc. Those attributes can create a data set, which contains hidden information and undiscovered patterns that can be revealed by using data mining methods and techniques. The main purpose of this study is to define voting tendencies in politics by using some of data mining methods. According to that purpose, the survey results, which were prepared and applied before 2011 elections of Turkey by KONDA Research and Consultancy Company, were used as raw data set. After Preprocessing of data, models were generated via data mining algorithms, such as Gini, C4.5 Decision Tree, Naive Bayes and Random Forest. Because of increasing popularity and flexibility in analyzing process, R language and Rstudio environment were used.

Download Full-text

The Rise of Fashion Informatics: A Case of Data-Mining-Based Social Network Analysis in Fashion

Clothing and Textiles Research Journal ◽

10.1177/0887302x18821187 ◽

2018 ◽

Vol 37 (2) ◽

pp. 87-102 ◽

Cited By ~ 6

Author(s):

Li Zhao ◽

Chao Min

Keyword(s):

Data Mining ◽

Social Media ◽

Social Network ◽

Social Network Analysis ◽

Network Analysis ◽

Large Scale ◽

Dynamic Network ◽

Fashion Industry ◽

Cognitive Computing ◽

Data Set

With the advent of modern cognitive computing technologies, fashion informatics researchers contribute to the academic and professional discussion about how a large-scale data set is able to reshape the fashion industry. Data-mining-based social network analysis is a promising area of fashion informatics to investigate relations and information flow among fashion units. By adopting this pragmatic approach, we provide dynamic network visualizations of the case of Paris Fashion Week. Three time periods were researched to monitor the formulation and mobilization of social media users’ discussions of the event. Initial textual data on social media were crawled, converted, calculated, and visualized by Python and Gephi. The most influential nodes (hashtags) that function as junctions and the distinct hashtag communities were identified and represented visually as graphs. The relations between the contextual clusters and the role of junctions in linking these clusters were investigated and interpreted.

Download Full-text

Migrating From Data Mining to Big Data Mining

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.4.14667 ◽

2018 ◽

Vol 7 (3.4) ◽

pp. 13

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Data Mining ◽

Big Data ◽

Response Time ◽

Large Scale ◽

Naive Bayes ◽

Naïve Bayes ◽

Data Mining Algorithm ◽

Big Data Mining ◽

Data Mining Algorithms ◽

Mining Algorithms

Data mining is one of the most researched fields in computer science. Several researches have been carried out to extract and analyse important information from raw data. Traditional data mining algorithms like classification, clustering and statistical analysis can process small scale of data with great efficiency and accuracy. Social networking interactions, business transactions and other communications result in Big data. It is large scale of data which is not in competency for traditional data mining techniques. It is observed that traditional data mining algorithms are not capable for storage and processing of large scale of data. If some algorithms are capable, then response time is very high. Big data have hidden information, if that is analysed in intelligent manner can be highly beneficial for business organizations. In this paper, we have analysed the advancement from traditional data mining algorithms to Big data mining algorithms. Applications of traditional data mining algorithms can be straight forward incorporated in Big data mining algorithm. Several studies have analysed traditional data mining with Big data mining, but very few have analysed most important algortihsm within one research work, which is the core motive of our paper. Readers can easily observe the difference between these algorthithms with pros and cons. Mathemtics concepts are applied in data mining algorithms. Means and Euclidean distance calculation in Kmeans, Vectors application and margin in SVM and Bayes therorem, conditional probability in Naïve Bayes algorithm are real examples. Classification and clustering are the most important applications of data mining. In this paper, Kmeans, SVM and Naïve Bayes algorithms are analysed in detail to observe the accuracy and response time both on concept and empirical perspective. Hadoop, Mapreduce etc. Big data technologies are used for implementing Big data mining algorithms. Performace evaluation metrics like speedup, scaleup and response time are used to compare traditional mining with Big data mining.

Download Full-text