Mining spatial–temporal motion pattern for vessel recognition

Approaches of vessel recognition are mostly accomplished by sensing targets and extracting target features, without taking advantage of spatial and temporal motion features. With maritime situation management systems widely applied, vessels’ spatial and temporal state information can be obtained by many kinds of distributed sensors, which is easy to achieve long-time accumulation but are often forgotten in databases. In order to get valuable information from large-scale stored trajectories for unknown vessel recognition, a spatial and temporal constrained trajectory similarity model and a mining algorithm based on spatial and temporal constrained trajectory similarity are proposed in this article by searching trajectories with similar motion features. Based on the idea of finding matching points between trajectories, baseline matching points are first defined to provide time reference for trajectories at different time, then the almost matching points are obtained by setting the spatial and temporal constraints, and the similarity of pairwise almost matching points is defined, which derives the spatial and temporal similarity of trajectories. By searching the matching points from trajectories, the similar motion pattern is extracted. Experiments on real data sets show that the proposed algorithm is useful for similar moving behavior mining from historic trajectories, which can strengthen motion feature with the length increases, and the support for vessel with unknown property is larger than other models.

Download Full-text

Self-Adaptive K-Means Based on a Covering Algorithm

Complexity ◽

10.1155/2018/7698274 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Yiwen Zhang ◽

Yuanyuan Zhou ◽

Xing Guo ◽

Jintao Wu ◽

Qiang He ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Real Data ◽

Second Phase ◽

Data Sets ◽

Number Of Clusters ◽

Large Scale Data ◽

Long Time ◽

Two Phases ◽

Selection Of

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

Download Full-text

Community Replacement Pathways: What Do Fossil Sequences Reveal About Marine Ecosystem Transitions?

The Paleontological Society Special Publications ◽

10.1017/s2475262200005530 ◽

1990 ◽

Vol 5 ◽

pp. 262-272

Author(s):

William Miller

Keyword(s):

Fossil Record ◽

Large Scale ◽

Marine Ecosystem ◽

Small Scale ◽

Data Sets ◽

Mass Extinctions ◽

Ultimate Cause ◽

Long Time ◽

History Of ◽

Time And Energy

Paleontologists have lavished much time and energy on description and explanation of large-scale patterns in the fossil record (e.g., mass extinctions, histories of monophyletic taxa, deployment of major biogeographic units), while paying comparatively little attention to biologic patterns preserved only in local stratigraphic sequences. Interpretation of the large-scale patterns will always be seen as the chief justification for the science of paleontology, but solving problems framed by long time spans and large areas is rife with tenuous inference and patterns are prone to varied interpretation by different investigators using virtually the same data sets (as in the controversy over ultimate cause of the terminal Cretaceous extinctions). In other words, the large-scale patterns in the history of life are the true philosophical property of paleontology, but there will always be serious problems in attempting to resolve processes that transpired over millions to hundreds-of-millions of years and encompassed vast areas of seafloor or landscape. By contrast, less spectacular and more commonplace changes in local habitats (often related to larger-scale events and cycles) and attendant biologic responses are closer to our direct experience of the living world and should be easier to interpret unequivocally. These small-scale responses are reflected in the fossil record at the scale of local outcrops.

Download Full-text

Data Provenance

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch085 ◽

2011 ◽

pp. 544-549

Author(s):

Vikram Sorathia

Keyword(s):

Large Scale ◽

Data Sources ◽

Computer Networking ◽

Sensor Technology ◽

Data Provenance ◽

Data Sets ◽

Automated Identification ◽

Automated Generation ◽

Time Stamps ◽

Long Time

In recent years, our sensing capability has increased manifold. The developments in sensor technology, telecommunication, computer networking and distributed computing domain have created strong grounds for building sensor networks that are now reaching global scales (Balazinska et al., 2007). As data sources are increasing, the task of processing and analysis has gone beyond the capabilities of conventional desktop data processing tools. For quite a long time, data was assumed to be available on the single user-desktop; and handling, processing as well as analysis was carried out single-handedly. With proliferation of streaming data-sources and near real-time applications, it has become important to make provisions of automated identification and attribution of data-sets derived from such diverse sources. Considering the sharing and reuse of such diverse data-sets, the information about: the source of data, ownership, time-stamps, accuracy related details, processes and transformations subjected to it etc. have become essential. The piece of data that provide such information about the given data-set is known as Metadata. The need is recognized for creating and handling of metadata as an integrated part of large-scale systems. Considering the information requirements of scientific and research community, the efforts towards the building global data commons have came into existence (Onsrud & Campbell, 2007). A special type of service is required that can address the issues like: explication of licensing & Intellectual Property Rights, standards based automated generation of metadata, data provenance, archival and peer-review. While each of these terms is being addressed as individual research topics, the present article is focused only on Data Provenance.

Download Full-text

A systematic comparison of chloroplast genome assembly tools

10.1101/665869 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jan A Freudenthal ◽

Simon Pfaff ◽

Niklas Terhoeven ◽

Arthur Korte ◽

Markus J Ankenbrand ◽

...

Keyword(s):

Chloroplast Genome ◽

Large Scale ◽

Real Data ◽

Data Sets ◽

Sequencing Data ◽

Complete Chloroplast Genome ◽

Plastid Genomes ◽

Chloroplast Genomes ◽

Intracellular Organelles ◽

Large Scale Screening

AbstractBackgroundChloroplasts are intracellular organelles that enable plants to conduct photosynthesis. They arose through the symbiotic integration of a prokaryotic cell into an eukaryotic host cell and still contain their own genomes with distinct genomic information. Plastid genomes accommodate essential genes and are regularly utilized in biotechnology or phylogenetics. Different assemblers that are able to assess the plastid genome have been developed. These assemblers often use data of whole genome sequencing experiments, which usually contain reads from the complete chloroplast genome.ResultsThe performance of different assembly tools has never been systematically compared. Here we present a benchmark of seven chloroplast assembly tools, capable of succeeding in more than 60% of known real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. The examination of 105 data sets from species with unknown plastid genomes leads to the assembly of 20 novel chloroplast genomes.ConclusionsWe create docker images for each tested tool that are freely available for the scientific community and ensure reproducibility of the analyses. These containers allow the analysis and screening of data sets for chloroplast genomes using standard computational infrastructure. Thus, large scale screening for chloroplasts within genomic sequencing data is feasible.

Download Full-text

Machine Learning Based Teaching Quality Evaluation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.271-273.1451 ◽

2011 ◽

Vol 271-273 ◽

pp. 1451-1454

Author(s):

Gang Zhang ◽

Jian Yin ◽

Liang Lun Cheng ◽

Chun Ru Wang

Keyword(s):

Machine Learning ◽

Large Scale ◽

Quality Evaluation ◽

College Teaching ◽

Real Data ◽

Teaching Quality ◽

Data Sets ◽

Stable Model ◽

Learning Framework ◽

Artificial Neural Network Ann

Teaching quality is a key metric in college teaching effect and ability evaluation. In many previous literatures, evaluation of such metric is merely depended on subjective judgment of few experts based on their experience, which leads to some false, bias or unstable results. Moreover, pure human based evaluation is expensive that is difficult to extend to large scale. With the application of information technology, much information in college teaching is recorded and stored electronically, which founds the basic of a computer-aid analysis. In this paper, we perform teaching quality evaluation within machine learning framework, focusing on learning and modeling electronic information associated with quality of teaching, to get a stable model described the substantial principles of teaching quality. Artificial Neural Network (ANN) is selected as the main model in this work. Experiment results on real data sets consisted of 4 subjects / 8 semesters show the effectiveness of the proposed method.

Download Full-text

A systematic comparison of chloroplast genome assembly tools

Genome Biology ◽

10.1186/s13059-020-02153-6 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Jan A. Freudenthal ◽

Simon Pfaff ◽

Niklas Terhoeven ◽

Arthur Korte ◽

Markus J. Ankenbrand ◽

...

Keyword(s):

Chloroplast Genome ◽

Large Scale ◽

Real Data ◽

Data Sets ◽

Sequencing Data ◽

Complete Chloroplast Genome ◽

Plastid Genomes ◽

Chloroplast Genomes ◽

Intracellular Organelles ◽

Large Scale Screening

Abstract Background Chloroplasts are intracellular organelles that enable plants to conduct photosynthesis. They arose through the symbiotic integration of a prokaryotic cell into an eukaryotic host cell and still contain their own genomes with distinct genomic information. Plastid genomes accommodate essential genes and are regularly utilized in biotechnology or phylogenetics. Different assemblers that are able to assess the plastid genome have been developed. These assemblers often use data of whole genome sequencing experiments, which usually contain reads from the complete chloroplast genome. Results The performance of different assembly tools has never been systematically compared. Here, we present a benchmark of seven chloroplast assembly tools, capable of succeeding in more than 60% of known real data sets. Our results show significant differences between the tested assemblers in terms of generating whole chloroplast genome sequences and computational requirements. The examination of 105 data sets from species with unknown plastid genomes leads to the assembly of 20 novel chloroplast genomes. Conclusions We create docker images for each tested tool that are freely available for the scientific community and ensure reproducibility of the analyses. These containers allow the analysis and screening of data sets for chloroplast genomes using standard computational infrastructure. Thus, large scale screening for chloroplasts within genomic sequencing data is feasible.

Download Full-text

Generic inference of inflation models by local non-Gaussianity

Proceedings of the International Astronomical Union ◽

10.1017/s1743921314010667 ◽

2014 ◽

Vol 10 (S306) ◽

pp. 51-53

Author(s):

Sebastian Dorn ◽

Erandy Ramirez ◽

Kerstin E. Kunze ◽

Stefan Hofmann ◽

Torsten A. Enßlin

Keyword(s):

Large Scale ◽

Real Data ◽

Analytic Method ◽

Sampling Techniques ◽

Data Sets ◽

Higher Order Statistics ◽

Detectable Amount ◽

Microwave Background ◽

Saddle Point Approximation ◽

Inflationary Parameters

AbstractThe presence of multiple fields during inflation might seed a detectable amount of non-Gaussianity in the curvature perturbations, which in turn becomes observable in present data sets like the cosmic microwave background (CMB) or the large scale structure (LSS). Within this proceeding we present a fully analytic method to infer inflationary parameters from observations by exploiting higher-order statistics of the curvature perturbations. To keep this analyticity, and thereby to dispense with numerically expensive sampling techniques, a saddle-point approximation is introduced whose precision has been validated for a numerical toy example. Applied to real data, this approach might enable to discriminate among the still viable models of inflation.

Download Full-text

Combining Random Forests and a Signal Detection Method Leads to the Robust Detection of Genotype-Phenotype Associations

Genes ◽

10.3390/genes11080892 ◽

2020 ◽

Vol 11 (8) ◽

pp. 892 ◽

Cited By ~ 1

Author(s):

Faisal Ramzan ◽

Mehmet Gültas ◽

Hendrik Bertram ◽

David Cavero ◽

Armin Otto Schmitt

Keyword(s):

Random Forests ◽

Egg Quality ◽

Association Studies ◽

Real Data ◽

Data Sets ◽

Genome Wide Association Studies ◽

Test Statistic ◽

Robust Detection ◽

Genomic Variants ◽

Long Time

Genome wide association studies (GWAS) are a well established methodology to identify genomic variants and genes that are responsible for traits of interest in all branches of the life sciences. Despite the long time this methodology has had to mature the reliable detection of genotype–phenotype associations is still a challenge for many quantitative traits mainly because of the large number of genomic loci with weak individual effects on the trait under investigation. Thus, it can be hypothesized that many genomic variants that have a small, however real, effect remain unnoticed in many GWAS approaches. Here, we propose a two-step procedure to address this problem. In a first step, cubic splines are fitted to the test statistic values and genomic regions with spline-peaks that are higher than expected by chance are considered as quantitative trait loci (QTL). Then the SNPs in these QTLs are prioritized with respect to the strength of their association with the phenotype using a Random Forests approach. As a case study, we apply our procedure to real data sets and find trustworthy numbers of, partially novel, genomic variants and genes involved in various egg quality traits.

Download Full-text

Cluster-Based Prediction for Batteries in Data Centers

Energies ◽

10.3390/en13051085 ◽

2020 ◽

Vol 13 (5) ◽

pp. 1085

Author(s):

Syed Naeem Haider ◽

Qianchuan Zhao ◽

Xueliang Li

Keyword(s):

Data Center ◽

Large Scale ◽

Data Centers ◽

Moving Average ◽

Arima Model ◽

Real Life ◽

Real Data ◽

Data Sets ◽

Multiple Time ◽

Battery Management

Prediction of a battery’s health in data centers plays a significant role in Battery Management Systems (BMS). Data centers use thousands of batteries, and their lifespan ultimately decreases over time. Predicting battery’s degradation status is very critical, even before the first failure is encountered during its discharge cycle, which also turns out to be a very difficult task in real life. Therefore, a framework to improve Auto-Regressive Integrated Moving Average (ARIMA) accuracy for forecasting battery’s health with clustered predictors is proposed. Clustering approaches, such as Dynamic Time Warping (DTW) or k-shape-based, are beneficial to find patterns in data sets with multiple time series. The aspect of large number of batteries in a data center is used to cluster the voltage patterns, which are further utilized to improve the accuracy of the ARIMA model. Our proposed work shows that the forecasting accuracy of the ARIMA model is significantly improved by applying the results of the clustered predictor for batteries in a real data center. This paper presents the actual historical data of 40 batteries of the large-scale data center for one whole year to validate the effectiveness of the proposed methodology.

Download Full-text

A Fast Lasso-Based Method for Inferring Pairwise Interactions

10.1101/2021.01.28.428698 ◽

2021 ◽

Author(s):

Kieran Elmes ◽

Astra Heywood ◽

Zhiyi Huang ◽

Alex Gavryushkin

Keyword(s):

Large Scale ◽

Association Studies ◽

Bacterial Species ◽

Real Data ◽

Epistatic Effect ◽

Resistance Testing ◽

Data Sets ◽

Epistatic Interactions ◽

Interaction Detection ◽

Pairwise Interactions

AbstractLarge-scale genotype-phenotype screens provide a wealth of data for identifying molecular alternations associated with a phenotype. Epistatic effects play an important role in such association studies. For example, siRNA perturbation screens can be used to identify pairwise gene-silencing effects. In bacteria, epistasis has practical consequences in determining antimicrobial resistance as the genetic background of a strain plays an important role in determining resistance. Existing computational tools which account for epistasis do not scale to human exome-wide screens and struggle with genetically diverse bacterial species such as Pseudomonas aeruginosa. Combining earlier work in interaction detection with recent advances in integer compression, we present a method for epistatic interaction detection on sparse (human) exome-scale data, and an R implementation in the package Pint. Our method takes advantage of sparsity in the input data and recent progress in integer compression to perform lasso-penalised linear regression on all pairwise combinations of the input, estimating up to 200 million potential effects, including epistatic interactions. Hence the human exome is within the reach of our method, assuming one parameter per gene and one parameter per epistatic effect for every pair of genes. We demonstrate Pint on both simulated and real data sets, including antibiotic resistance testing and siRNA perturbation screens.

Download Full-text