massive data sets
Recently Published Documents


TOTAL DOCUMENTS

99
(FIVE YEARS 19)

H-INDEX

15
(FIVE YEARS 2)

2022 ◽  
pp. 41-67
Author(s):  
Vo Ngoc Phu ◽  
Vo Thi Ngoc Tran

Machine learning (ML), neural network (NN), evolutionary algorithm (EA), fuzzy systems (FSs), as well as computer science have been very famous and very significant for many years. They have been applied to many different areas. They have contributed much to developments of many large-scale corporations, massive organizations, etc. Lots of information and massive data sets (MDSs) have been generated from these big corporations, organizations, etc. These big data sets (BDSs) have been the challenges of many commercial applications, researches, etc. Therefore, there have been many algorithms of the ML, the NN, the EA, the FSs, as well as computer science which have been developed to handle these massive data sets successfully. To support for this process, the authors have displayed all the possible algorithms of the NN for the large-scale data sets (LSDSs) successfully in this chapter. Finally, they have presented a novel model of the NN for the BDS in a sequential environment (SE) and a distributed network environment (DNE).


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 30-31
Author(s):  
Kathryn Kemper

Abstract Genomic selection has been implemented successfully in many livestock industries for genetic improvement. However, genomic selection provides limited insight into the genetic mechanisms underlying variation in complex traits. In contrast, human genetics has a focus on understanding genetic architecture and the origins of quantitative trait variation. This presentation will discuss a number of examples from human genetics which can inform our understanding of the nature of variation in complex traits. So-called ‘monogenic’ conditions, for example, are proving to have more complex genetic architecture than naïve expectations might suggest. Massive data sets of millions of people are also enabling longstanding questions to be addressed. Traits such as height, for example, are affected by a very large but finite number of loci. We can reconcile seemingly disparate heritability estimates from different experimental designs by accounting for assortative mating. The presentation will provide a brief update on current approaches to genomic prediction in human genetics and discuss the implications of these findings for understanding and predicting complex traits in livestock.


2020 ◽  
Vol 10 (6) ◽  
pp. 1343-1358
Author(s):  
Ernesto Iadanza ◽  
Rachele Fabbri ◽  
Džana Bašić-ČiČak ◽  
Amedeo Amedei ◽  
Jasminka Hasic Telalovic

Abstract This article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sathyaraj R ◽  
Ramanathan L ◽  
Lavanya K ◽  
Balasubramanian V ◽  
Saira Banu J

PurposeThe innovation in big data is increasing day by day in such a way that the conventional software tools face several problems in managing the big data. Moreover, the occurrence of the imbalance data in the massive data sets is a major constraint to the research industry.Design/methodology/approachThe purpose of the paper is to introduce a big data classification technique using the MapReduce framework based on an optimization algorithm. The big data classification is enabled using the MapReduce framework, which utilizes the proposed optimization algorithm, named chicken-based bacterial foraging (CBF) algorithm. The proposed algorithm is generated by integrating the bacterial foraging optimization (BFO) algorithm with the cat swarm optimization (CSO) algorithm. The proposed model executes the process in two stages, namely, training and testing phases. In the training phase, the big data that is produced from different distributed sources is subjected to parallel processing using the mappers in the mapper phase, which perform the preprocessing and feature selection based on the proposed CBF algorithm. The preprocessing step eliminates the redundant and inconsistent data, whereas the feature section step is done on the preprocessed data for extracting the significant features from the data, to provide improved classification accuracy. The selected features are fed into the reducer for data classification using the deep belief network (DBN) classifier, which is trained using the proposed CBF algorithm such that the data are classified into various classes, and finally, at the end of the training process, the individual reducers present the trained models. Thus, the incremental data are handled effectively based on the training model in the training phase. In the testing phase, the incremental data are taken and split into different subsets and fed into the different mappers for the classification. Each mapper contains a trained model which is obtained from the training phase. The trained model is utilized for classifying the incremental data. After classification, the output obtained from each mapper is fused and fed into the reducer for the classification.FindingsThe maximum accuracy and Jaccard coefficient are obtained using the epileptic seizure recognition database. The proposed CBF-DBN produces a maximal accuracy value of 91.129%, whereas the accuracy values of the existing neural network (NN), DBN, naive Bayes classifier-term frequency–inverse document frequency (NBC-TFIDF) are 82.894%, 86.184% and 86.512%, respectively. The Jaccard coefficient of the proposed CBF-DBN produces a maximal Jaccard coefficient value of 88.928%, whereas the Jaccard coefficient values of the existing NN, DBN, NBC-TFIDF are 75.891%, 79.850% and 81.103%, respectively.Originality/valueIn this paper, a big data classification method is proposed for categorizing massive data sets for meeting the constraints of huge data. The big data classification is performed on the MapReduce framework based on training and testing phases in such a way that the data are handled in parallel at the same time. In the training phase, the big data is obtained and partitioned into different subsets of data and fed into the mapper. In the mapper, the features extraction step is performed for extracting the significant features. The obtained features are subjected to the reducers for classifying the data using the obtained features. The DBN classifier is utilized for the classification wherein the DBN is trained using the proposed CBF algorithm. The trained model is obtained as an output after the classification. In the testing phase, the incremental data are considered for the classification. New data are first split into subsets and fed into the mapper for classification. The trained models obtained from the training phase are used for the classification. The classified results from each mapper are fused and fed into the reducer for the classification of big data.


2020 ◽  
Vol 3 (2) ◽  
pp. 42
Author(s):  
Xiuzhang Yang ◽  
Shuai Wu ◽  
Huan Xia ◽  
Yuanbo Li ◽  
Xin Li

With the advent of the era of big data and the development and construction of smart campuses, the campus is gradually moving towards digitalization, networking and informationization. The campus card is an important part of the construction of a smart campus, and the massive data it generates can indirectly reflect the living conditions of students at school. In the face of the campus card, how to quickly and accurately obtain the information required by users from the massive data sets has become an urgent problem that needs to be solved. This paper proposes a data mining algorithm based on K-Means clustering and time series. It analyzes the consumption data of a college student’s card to deeply mine and analyze the daily life consumer behavior habits of students, and to make an accurate judgment on the specific life consumer behavior. The algorithm proposed in this paper provides a practical reference for the construction of smart campuses in universities, and has important theoretical and application values.


2020 ◽  
Vol 18 (1) ◽  
pp. 2-35
Author(s):  
Miodrag M. Lovric

The Jeffreys-Lindley paradox is the most quoted divergence between the frequentist and Bayesian approaches to statistical inference. It is embedded in the very foundations of statistics and divides frequentist and Bayesian inference in an irreconcilable way. This paradox is the Gordian Knot of statistical inference and Data Science in the Zettabyte Era. If statistical science is ready for revolution confronted by the challenges of massive data sets analysis, the first step is to finally solve this anomaly. For more than sixty years, the Jeffreys-Lindley paradox has been under active discussion and debate. Many solutions have been proposed, none entirely satisfactory. The Jeffreys-Lindley paradox and its extent have been frequently misunderstood by many statisticians and non-statisticians. This paper aims to reassess this paradox, shed new light on it, and indicates how often it occurs in practice when dealing with Big data.


Sign in / Sign up

Export Citation Format

Share Document