Influence Maximization Algorithm Based on Reverse Reachable Set

Mapping Intimacies ◽

10.20944/preprints202102.0213.v1 ◽

2021 ◽

Author(s):

Gengxin Sun ◽

Chih-Cheng Chen

Keyword(s):

Time Complexity ◽

Large Scale ◽

Real Data ◽

Influence Maximization ◽

Reachable Set ◽

Data Sets ◽

Influence Propagation ◽

Independent Cascade Model ◽

Propagation Function ◽

Maximization Algorithms

Most of the existing influence maximization algorithms are not suitable for large-scale social networks due to their high time complexity or limited influence propagation range. Therefore, a D-RIS influence maximization algorithm is proposed based on the independent cascade model and combined with the reverse reachable set sampling. Under the premise that the influence propagation function satisfies monotonicity and submodularity, the D-RIS algorithm uses automatic debugging method to determine the critical value of the number of reverse reachable sets, which not only obtains a better influence propagation range, and greatly reduce the time complexity. The experimental results on the two real data sets of Slashdot and Epinions show that D-RIS algorithm is close to the CELF algorithm and higher than RIS algorithm, HighDegree algorithm, LIR algorithm and pBmH algorithm in influence propagation range. At the same time, it is significantly better than the CELF algorithm and RIS algorithm in running time, which indicates that D-RIS algorithm is more suitable for large scale social network.

Download Full-text

Influence Maximization Algorithm Based on Reverse Reachable Set

Mathematical Problems in Engineering ◽

10.1155/2021/5535843 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Gengxin Sun ◽

Chih-Cheng Chen

Keyword(s):

Time Complexity ◽

Large Scale ◽

Cost Effective ◽

Population Based ◽

Influence Maximization ◽

Reachable Set ◽

Forward Algorithm ◽

Influence Propagation ◽

Propagation Function ◽

Maximization Algorithms

Most of the existing influence maximization algorithms are not suitable for large-scale social networks due to their high time complexity or limited influence propagation range. Therefore, a D-RIS (dynamic-reverse reachable set) influence maximization algorithm is proposed based on the independent cascade model and combined with the reverse reachable set sampling. Under the premise that the influence propagation function satisfies monotonicity and submodularity, the D-RIS algorithm uses an automatic debugging method to determine the critical value of the number of reverse reachable sets, which not only obtains a better influence propagation range but also greatly reduces the time complexity. The experimental results on the two real datasets of Slashdot and Epinions show that D-RIS algorithm is close to the CELF (cost-effective lazy-forward) algorithm and higher than RIS algorithm, HighDegree algorithm, LIR algorithm, and pBmH (population-based metaheuristics) algorithm in influence propagation range. At the same time, it is significantly better than the CELF algorithm and RIS algorithm in running time, which indicates that D-RIS algorithm is more suitable for large-scale social network.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Large-Scale Data Classification Based on Ball Vector Machine

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.312.771 ◽

2013 ◽

Vol 312 ◽

pp. 771-776

Author(s):

Min Juan Zheng ◽

Guo Jian Cheng ◽

Fei Zhao

Keyword(s):

Quadratic Programming ◽

Programming Problem ◽

Time Complexity ◽

Large Scale ◽

Space Complexity ◽

Quadratic Programming Problem ◽

Support Vector ◽

Data Sets ◽

Standard Support Vector Machine ◽

Large Scale Data

The quadratic programming problem in the standard support vector machine (SVM) algorithm has high time complexity and space complexity in solving the large-scale problems which becomes a bottleneck in the SVM applications. Ball Vector Machine (BVM) converts the quadratic programming problem of the traditional SVM into the minimum enclosed ball problem (MEB). It can indirectly get the solution of quadratic programming through solving the MEB problem which significantly reduces the time complexity and space complexity. The experiments show that when handling five large-scale and high-dimensional data sets, the BVM and standard SVM have a considerable accuracy, but the BVM has higher speed and less requirement space than standard SVM.

Download Full-text

Scalable influence maximization for independent cascade model in large-scale social networks

Data Mining and Knowledge Discovery ◽

10.1007/s10618-012-0262-1 ◽

2012 ◽

Vol 25 (3) ◽

pp. 545-576 ◽

Cited By ~ 144

Author(s):

Chi Wang ◽

Wei Chen ◽

Yajun Wang

Keyword(s):

Social Networks ◽

Large Scale ◽

Influence Maximization ◽

Cascade Model ◽

Independent Cascade Model ◽

Independent Cascade

Download Full-text

A systematic comparison of chloroplast genome assembly tools

10.1101/665869 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jan A Freudenthal ◽

Simon Pfaff ◽

Niklas Terhoeven ◽

Arthur Korte ◽

Markus J Ankenbrand ◽

...

Keyword(s):

Chloroplast Genome ◽

Large Scale ◽

Real Data ◽

Data Sets ◽

Sequencing Data ◽

Complete Chloroplast Genome ◽

Plastid Genomes ◽

Chloroplast Genomes ◽

Intracellular Organelles ◽

Large Scale Screening

AbstractBackgroundChloroplasts are intracellular organelles that enable plants to conduct photosynthesis. They arose through the symbiotic integration of a prokaryotic cell into an eukaryotic host cell and still contain their own genomes with distinct genomic information. Plastid genomes accommodate essential genes and are regularly utilized in biotechnology or phylogenetics. Different assemblers that are able to assess the plastid genome have been developed. These assemblers often use data of whole genome sequencing experiments, which usually contain reads from the complete chloroplast genome.ResultsThe performance of different assembly tools has never been systematically compared. Here we present a benchmark of seven chloroplast assembly tools, capable of succeeding in more than 60% of known real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. The examination of 105 data sets from species with unknown plastid genomes leads to the assembly of 20 novel chloroplast genomes.ConclusionsWe create docker images for each tested tool that are freely available for the scientific community and ensure reproducibility of the analyses. These containers allow the analysis and screening of data sets for chloroplast genomes using standard computational infrastructure. Thus, large scale screening for chloroplasts within genomic sequencing data is feasible.

Download Full-text

Machine Learning Based Teaching Quality Evaluation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1451 ◽

2011 ◽

Vol 271-273 ◽

pp. 1451-1454

Author(s):

Gang Zhang ◽

Jian Yin ◽

Liang Lun Cheng ◽

Chun Ru Wang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Quality Evaluation ◽

College Teaching ◽

Real Data ◽

Teaching Quality ◽

Data Sets ◽

Stable Model ◽

Learning Framework ◽

Artificial Neural Network Ann

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.

Download Full-text

A systematic comparison of chloroplast genome assembly tools

Genome Biology ◽

10.1186/s13059-020-02153-6 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Jan A. Freudenthal ◽

Simon Pfaff ◽

Niklas Terhoeven ◽

Arthur Korte ◽

Markus J. Ankenbrand ◽

...

Keyword(s):

Chloroplast Genome ◽

Large Scale ◽

Real Data ◽

Data Sets ◽

Sequencing Data ◽

Complete Chloroplast Genome ◽

Plastid Genomes ◽

Chloroplast Genomes ◽

Intracellular Organelles ◽

Large Scale Screening

Abstract Background Chloroplasts are intracellular organelles that enable plants to conduct photosynthesis. They arose through the symbiotic integration of a prokaryotic cell into an eukaryotic host cell and still contain their own genomes with distinct genomic information. Plastid genomes accommodate essential genes and are regularly utilized in biotechnology or phylogenetics. Different assemblers that are able to assess the plastid genome have been developed. These assemblers often use data of whole genome sequencing experiments, which usually contain reads from the complete chloroplast genome. Results The performance of different assembly tools has never been systematically compared. Here, we present a benchmark of seven chloroplast assembly tools, capable of succeeding in more than 60% of known real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. The examination of 105 data sets from species with unknown plastid genomes leads to the assembly of 20 novel chloroplast genomes. Conclusions We create docker images for each tested tool that are freely available for the scientific community and ensure reproducibility of the analyses. These containers allow the analysis and screening of data sets for chloroplast genomes using standard computational infrastructure. Thus, large scale screening for chloroplasts within genomic sequencing data is feasible.

Download Full-text

Generic inference of inflation models by local non-Gaussianity

Proceedings of the International Astronomical Union ◽

10.1017/s1743921314010667 ◽

2014 ◽

Vol 10 (S306) ◽

pp. 51-53

Author(s):

Sebastian Dorn ◽

Erandy Ramirez ◽

Kerstin E. Kunze ◽

Stefan Hofmann ◽

Torsten A. Enßlin

Keyword(s):

Large Scale ◽

Real Data ◽

Analytic Method ◽

Sampling Techniques ◽

Data Sets ◽

Higher Order Statistics ◽

Detectable Amount ◽

Microwave Background ◽

Saddle Point Approximation ◽

Inflationary Parameters

AbstractThe presence of multiple fields during inflation might seed a detectable amount of non-Gaussianity in the curvature perturbations, which in turn becomes observable in present data sets like the cosmic microwave background (CMB) or the large scale structure (LSS). Within this proceeding we present a fully analytic method to infer inflationary parameters from observations by exploiting higher-order statistics of the curvature perturbations. To keep this analyticity, and thereby to dispense with numerically expensive sampling techniques, a saddle-point approximation is introduced whose precision has been validated for a numerical toy example. Applied to real data, this approach might enable to discriminate among the still viable models of inflation.

Download Full-text

Cluster-Based Prediction for Batteries in Data Centers

Energies ◽

10.3390/en13051085 ◽

2020 ◽

Vol 13 (5) ◽

pp. 1085

Author(s):

Syed Naeem Haider ◽

Qianchuan Zhao ◽

Xueliang Li

Keyword(s):

Data Center ◽

Large Scale ◽

Data Centers ◽

Moving Average ◽

Arima Model ◽

Real Life ◽

Real Data ◽

Data Sets ◽

Multiple Time ◽

Battery Management

Prediction of a battery’s health in data centers plays a significant role in Battery Management Systems (BMS). Data centers use thousands of batteries, and their lifespan ultimately decreases over time. Predicting battery’s degradation status is very critical, even before the first failure is encountered during its discharge cycle, which also turns out to be a very difficult task in real life. Therefore, a framework to improve Auto-Regressive Integrated Moving Average (ARIMA) accuracy for forecasting battery’s health with clustered predictors is proposed. Clustering approaches, such as Dynamic Time Warping (DTW) or k-shape-based, are beneficial to find patterns in data sets with multiple time series. The aspect of large number of batteries in a data center is used to cluster the voltage patterns, which are further utilized to improve the accuracy of the ARIMA model. Our proposed work shows that the forecasting accuracy of the ARIMA model is significantly improved by applying the results of the clustered predictor for batteries in a real data center. This paper presents the actual historical data of 40 batteries of the large-scale data center for one whole year to validate the effectiveness of the proposed methodology.

Download Full-text

A Fast Lasso-Based Method for Inferring Pairwise Interactions

10.1101/2021.01.28.428698 ◽

2021 ◽

Author(s):

Kieran Elmes ◽

Astra Heywood ◽

Zhiyi Huang ◽

Alex Gavryushkin

Keyword(s):

Large Scale ◽

Association Studies ◽

Bacterial Species ◽

Real Data ◽

Epistatic Effect ◽

Resistance Testing ◽

Data Sets ◽

Epistatic Interactions ◽

Interaction Detection ◽

Pairwise Interactions

AbstractLarge-scale genotype-phenotype screens provide a wealth of data for identifying molecular alternations associated with a phenotype. Epistatic effects play an important role in such association studies. For example, siRNA perturbation screens can be used to identify pairwise gene-silencing effects. In bacteria, epistasis has practical consequences in determining antimicrobial resistance as the genetic background of a strain plays an important role in determining resistance. Existing computational tools which account for epistasis do not scale to human exome-wide screens and struggle with genetically diverse bacterial species such as Pseudomonas aeruginosa. Combining earlier work in interaction detection with recent advances in integer compression, we present a method for epistatic interaction detection on sparse (human) exome-scale data, and an R implementation in the package Pint. Our method takes advantage of sparsity in the input data and recent progress in integer compression to perform lasso-penalised linear regression on all pairwise combinations of the input, estimating up to 200 million potential effects, including epistatic interactions. Hence the human exome is within the reach of our method, assuming one parameter per gene and one parameter per epistatic effect for every pair of genes. We demonstrate Pint on both simulated and real data sets, including antibiotic resistance testing and siRNA perturbation screens.

Download Full-text