A Metropolis Sampling Method for Drawing Representative Samples from Large Databases

Author(s):  
Hong Guo ◽  
Wen-Chi Hou ◽  
Feng Yan ◽  
Qiang Zhu
2020 ◽  
Author(s):  
Jasper Foets ◽  
Carlos E. Wetzel ◽  
Núria Martinez-Carreras ◽  
Adriaan J. Teuling ◽  
Jean-François Iffly ◽  
...  

Abstract. Diatoms, microscopic, single-celled algae, are present in almost all habitats containing water (e.g. streams, lakes, soil, rocks) and form one of the most common and diverse algal groups in both freshwaters and marine ecosystems. In the terrestrial environment, their diversified species distributions are mainly controlled by physiographical factors and anthropic disturbances. This makes them useful tracers in catchment hydrology. In their use as a hydrological tracer, diatoms are generally sampled in streams by means of an automated sampling method and as a result many samples are collected to cover a whole storm run-off event. As diatom analysis is labour intensive, a trade-off has to be made between the number of sites and the amount of samples per site. A potential way to reduce this number is by using a time-integrated mass-flux sampler. Here, we explored the potential for the Phillips sampler to provide a representative sample of the diatom assemblage of a whole storm run-off event. We addressed this by comparing the diatom community composition of the Phillips sampler to the composite community collected by the automatic samplers for three events. Our results indicate that during two events the Phillips sampler sampled representative samples, whereas significantly different communities were collected during the third event. However, sediment data of this event, which was sampled with automatic samplers, showed much noise meaning that we could not verify if the Phillips sampler sampled representative communities or not. Nevertheless, we believe that this sampler could not only be applied in hydrological tracing using terrestrial diatoms, but may also be a useful tool in water quality assessment.


2013 ◽  
Vol 18 (4) ◽  
pp. 389-406 ◽  
Author(s):  
Jennifer Earl

Social movement scholars are increasingly interested in Internet activism but have struggled to find robust methods for identifying cases, particularly representative samples of online protest content, given that no population list exists. This article reviews early approaches to this problem, focusing on three recent case sampling designs that attempt to address this problem. The first approach purposively samples from an organizationally based sampling frame. The second approach randomly samples from a SMO-based sampling frame. The third approach mimics user routines to identify populations of "reachable" websites on a given topic, which are then randomly sampled. For each approach, I examine the sampling frame and sampling method to understand how cases were selected, outline the assumptions built into the overall sampling design, and discuss an exemplary research project employing each design. Comparisons of findings from these exemplar studies indicate that sampling designs are extremely consequential. I close by recommending best practices.


Author(s):  
Wen-Chi Hou ◽  
Hong Guo ◽  
Feng Yan ◽  
Qiang Zhu

Sampling has been used in areas like selectivity estimation (Hou & Ozsoyoglu, 1991; Haas & Swami, 1992, Jermaine, 2003; Lipton, Naughton & Schnerder, 1990; Wu, Agrawal, & Abbadi, 2001), OLAP (Acharya, Gibbons, & Poosala, 2000), clustering (Agrawal, Gehrke, Gunopulos, & Raghavan, 1998; Palmer & Faloutsos, 2000), and spatial data mining (Xu, Ester, Kriegel, & Sander, 1998). Due to its importance, sampling has been incorporated into modern database systems.


2019 ◽  
Vol 1 (2) ◽  
pp. 24
Author(s):  
Muhammad JufriansyahMuhammad Jufriansyah ◽  
Gustami Harahap ◽  
Mitra Musika Lubis

<p>In North Sumatra there  is  one  type  of  horticulture  plant,  strawberry,  especially  in Karo district. The purpose of this study  was  to  find  out  what  factors  influence  the  level  of income of strawberry picking agrotourism farmers themselves, knowing the price of their  own picking strawberry agro-tourism business and knowing whether the picking  strawberry  agro- tourism business itself is feasible. The sampling method is used  by  the  Central  theorem  limit method, the total population of strawberry farmers in Karo Regency  is  60,  in  this study 30 farmers were made as representative samples of the entire population. The data collected is primary  and  secondary  data.  The  analytical  method  used  is  multiple  linear  regression   with SPSS 21, BEP software tools and business feasibility analysis using  R  / C  Ratio.  (1) From the study using multiple linear regression analysis tools which have positive effect on strawberry farmers'  income,  namely  sales  and  expenditure  volume  of  RT.  (2)  Analysis  of  data  from (BEP), it is known that the sales  volume  reaches  the  level of 478.80 Kg with a selling price of  Rp. 52,760 / Kg, then the sales result is Rp. 38,304,239, with the results of the sale, the strawberry  picking  agro-business  itself  was  declared  even.  (3)  Analysis of the feasibility of the own strawberry picking agrotourism business in  Karo  Regency,  obtained  the  results  of R  / C&gt; 1, then the business is economically feasible.</p><p> </p>


2021 ◽  
Author(s):  
Laura H. Tung ◽  
Carl Kingsford

AbstractDespite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the huge number of available RNA-seq samples and the large number of k-mers/sequences in each sample, computing the full similarity matrix between all samples using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges, making direct representative set selection infeasible with limited computing resources. Therefore, we developed a novel computational method called “hierarchical representative set selection” to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks the representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve performance close to that of direct representative set selection, while largely reducing the runtime and memory requirements of computing the full similarity matrix (up to 8.4X runtime reduction and 4.7X memory reduction for 10000 samples that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases such as the SRA.


2020 ◽  
Vol 63 (6) ◽  
pp. 1947-1957
Author(s):  
Alexandra Hollo ◽  
Johanna L. Staubitz ◽  
Jason C. Chow

Purpose Although sampling teachers' child-directed speech in school settings is needed to understand the influence of linguistic input on child outcomes, empirical guidance for measurement procedures needed to obtain representative samples is lacking. To optimize resources needed to transcribe, code, and analyze classroom samples, this exploratory study assessed the minimum number and duration of samples needed for a reliable analysis of conventional and researcher-developed measures of teacher talk in elementary classrooms. Method This study applied fully crossed, Person (teacher) × Session (samples obtained on 3 separate occasions) generalizability studies to analyze an extant data set of three 10-min language samples provided by 28 general and special education teachers recorded during large-group instruction across the school year. Subsequently, a series of decision studies estimated of the number and duration of sessions needed to obtain the criterion g coefficient ( g > .70). Results The most stable variables were total number of words and mazes, requiring only a single 10-min sample, two 6-min samples, or three 3-min samples to reach criterion. No measured variables related to content or complexity were adequately stable regardless of number and duration of samples. Conclusions Generalizability studies confirmed that a large proportion of variance was attributable to individuals rather than the sampling occasion when analyzing the amount and fluency of spontaneous teacher talk. In general, conventionally reported outcomes were more stable than researcher-developed codes, which suggests some categories of teacher talk are more context dependent than others and thus require more intensive data collection to measure reliably.


2007 ◽  
Vol 23 (4) ◽  
pp. 248-257 ◽  
Author(s):  
Matthias R. Mehl ◽  
Shannon E. Holleran

Abstract. In this article, the authors provide an empirical analysis of the obtrusiveness of and participants' compliance with a relatively new psychological ambulatory assessment method, called the electronically activated recorder or EAR. The EAR is a modified portable audio-recorder that periodically records snippets of ambient sounds from participants' daily environments. In tracking moment-to-moment ambient sounds, the EAR yields an acoustic log of a person's day as it unfolds. As a naturalistic observation sampling method, it provides an observer's account of daily life and is optimized for the assessment of audible aspects of participants' naturally-occurring social behaviors and interactions. Measures of self-reported and behaviorally-assessed EAR obtrusiveness and compliance were analyzed in two samples. After an initial 2-h period of relative obtrusiveness, participants habituated to wearing the EAR and perceived it as fairly unobtrusive both in a short-term (2 days, N = 96) and a longer-term (10-11 days, N = 11) monitoring. Compliance with the method was high both during the short-term and longer-term monitoring. Somewhat reduced compliance was identified over the weekend; this effect appears to be specific to student populations. Important privacy and data confidentiality considerations around the EAR method are discussed.


Sign in / Sign up

Export Citation Format

Share Document