A Method for Developing Trustworthiness and Preserving Richness of Qualitative Data During Team-Based Analysis of Large Data Sets

This article outlines a three-phase, team-based approach used to analyze qualitative data from a nation-wide needs assessment of access to Veteran Health Administration services for rural-dwelling veterans. The method described here was used to develop the trustworthiness of findings from analysis of a large qualitative data set, without the use of analytic software. In Phase 1, we used templates to summarize content from 205 individual semistructured interviews. During Phase 2, a matrix display was constructed for each of 10 project sites to synthesize and display template content by participant, domain, and category. In the final phase, the summary tabulation technique was developed by a member of our team to facilitate trustworthy observations regarding patterns and variation in the large volume of qualitative data produced by the interviews. This accessible and efficient team-based strategy was feasible within the constraints of our project while preserving the richness of qualitative data.

Download Full-text

A Team-based Approach to Open Coding: Considerations for Creating Intercoder Consensus

Field Methods ◽

10.1177/1525822x19838237 ◽

2019 ◽

Vol 31 (2) ◽

pp. 116-130 ◽

Cited By ~ 1

Author(s):

M. Ariel Cascio ◽

Eunlye Lee ◽

Nicole Vaudrin ◽

Darcy A. Freedman

Keyword(s):

Qualitative Data ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Analytical Strategy ◽

Analytic Process ◽

Semistructured Interviews ◽

Inductive Methods ◽

Open Coding ◽

Applied Fields

In this article, we discuss methodological opportunities related to using a team-based approach for iterative-inductive analysis of qualitative data involving detailed open coding of semistructured interviews and focus groups. Iterative-inductive methods generate rich thematic analyses useful in sociology, anthropology, public health, and many other applied fields. A team-based approach to analyzing qualitative data increases confidence in dependability and trustworthiness, facilitates analysis of large data sets, and supports collaborative and participatory research by including diverse stakeholders in the analytic process. However, it can be difficult to reach consensus when coding with multiple coders. We report on one approach for creating consensus when open coding within an iterative-inductive analytical strategy. The strategy described may be used in a variety of settings to foster efficient and credible analysis of larger qualitative data sets, particularly useful in applied research settings where rapid results are often required.

Download Full-text

Generation of geometric interpolations of building types with deep variational autoencoders

Design Science ◽

10.1017/dsj.2020.31 ◽

2020 ◽

Vol 6 ◽

Author(s):

Jaime de Miguel Rodríguez ◽

Maria Eugenia Villafañe ◽

Luka Piškorec ◽

Fernando Sancho Caparrini

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Large Data ◽

Learning Model ◽

Large Data Sets ◽

Data Sets ◽

Connectivity Map ◽

Data Set ◽

3D Objects ◽

Machine Learning Model

Abstract This work presents a methodology for the generation of novel 3D objects resembling wireframes of building types. These result from the reconstruction of interpolated locations within the learnt distribution of variational autoencoders (VAEs), a deep generative machine learning model based on neural networks. The data set used features a scheme for geometry representation based on a ‘connectivity map’ that is especially suited to express the wireframe objects that compose it. Additionally, the input samples are generated through ‘parametric augmentation’, a strategy proposed in this study that creates coherent variations among data by enabling a set of parameters to alter representative features on a given building type. In the experiments that are described in this paper, more than 150 k input samples belonging to two building types have been processed during the training of a VAE model. The main contribution of this paper has been to explore parametric augmentation for the generation of large data sets of 3D geometries, showcasing its problems and limitations in the context of neural networks and VAEs. Results show that the generation of interpolated hybrid geometries is a challenging task. Despite the difficulty of the endeavour, promising advances are presented.

Download Full-text

Effects of genotype and lactation number on health and reproductive problems in dairy cows

Proceedings of the British Society of Animal Science ◽

10.1017/s1752756200595842 ◽

1997 ◽

Vol 1997 ◽

pp. 143-143

Author(s):

B.L. Nielsen ◽

R.F. Veerkamp ◽

J.E. Pryce ◽

G. Simm ◽

J.D. Oldham

Keyword(s):

Dairy Cows ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Variation Analysis ◽

Genetic Line ◽

Data Set ◽

Health Events ◽

Use Of Data ◽

Low Incidence

High producing dairy cows have been found to be more susceptible to disease (Jones et al., 1994; Göhn et al., 1995) raising concerns about the welfare of the modern dairy cow. Genotype and number of lactations may affect various health problems differently, and their relative importance may vary. The categorical nature and low incidence of health events necessitates large data-sets, but the use of data collected across herds may introduce unwanted variation. Analysis of a comprehensive data-set from a single herd was carried out to investigate the effects of genetic line and lactation number on the incidence of various health and reproductive problems.

Download Full-text

Extreme Learning Machine with sigmoid activation function on large data

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1433.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3523-3526

Keyword(s):

Efficient Algorithm ◽

Large Data ◽

Activation Function ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Learning Machine ◽

Sigmoid Activation Function ◽

State Of Art ◽

Better Than

This paper describes an efficient algorithm for classification in large data set. While many algorithms exist for classification, they are not suitable for larger contents and different data sets. For working with large data sets various ELM algorithms are available in literature. However the existing algorithms using fixed activation function and it may lead deficiency in working with large data. In this paper, we proposed novel ELM comply with sigmoid activation function. The experimental evaluations demonstrate the our ELM-S algorithm is performing better than ELM,SVM and other state of art algorithms on large data sets.

Download Full-text

Scalable Non-Parametric Methods for Large Data Sets

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch260 ◽

2011 ◽

pp. 1708-1713

Author(s):

V. Suresh Babu ◽

P. Viswanath ◽

Narasimha M. Murty

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Parametric Methods ◽

Clustering Method ◽

Data Set ◽

Computational Burden ◽

Set Size ◽

Non Parametric

Non-parametric methods like the nearest neighbor classifier (NNC) and the Parzen-Window based density estimation (Duda, Hart & Stork, 2000) are more general than parametric methods because they do not make any assumptions regarding the probability distribution form. Further, they show good performance in practice with large data sets. These methods, either explicitly or implicitly estimates the probability density at a given point in a feature space by counting the number of points that fall in a small region around the given point. Popular classifiers which use this approach are the NNC and its variants like the k-nearest neighbor classifier (k-NNC) (Duda, Hart & Stock, 2000). Whereas the DBSCAN is a popular density based clustering method (Han & Kamber, 2001) which uses this approach. These methods show good performance, especially with larger data sets. Asymptotic error rate of NNC is less than twice the Bayes error (Cover & Hart, 1967) and DBSCAN can find arbitrary shaped clusters along with noisy outlier detection (Ester, Kriegel & Xu, 1996). The most prominent difficulty in applying the non-parametric methods for large data sets is its computational burden. The space and classification time complexities of NNC and k-NNC are O(n) where n is the training set size. The time complexity of DBSCAN is O(n2). So, these methods are not scalable for large data sets. Some of the remedies to reduce this burden are as follows. (1) Reduce the training set size by some editing techniques in order to eliminate some of the training patterns which are redundant in some sense (Dasarathy, 1991). For example, the condensed NNC (Hart, 1968) is of this type. (2) Use only a few selected prototypes from the data set. For example, Leaders-subleaders method and l-DBSCAN method are of this type (Vijaya, Murthy & Subramanian, 2004 and Viswanath & Rajwala, 2006). These two remedies can reduce the computational burden, but this can also result in a poor performance of the method. Using enriched prototypes can improve the performance as done in (Asharaf & Murthy, 2003) where the prototypes are derived using adaptive rough fuzzy set theory and as in (Suresh Babu & Viswanath, 2007) where the prototypes are used along with their relative weights. Using a few selected prototypes can reduce the computational burden. Prototypes can be derived by employing a clustering method like the leaders method (Spath, 1980), the k-means method (Jain, Dubes, & Chen, 1987), etc., which can find a partition of the data set where each block (cluster) of the partition is represented by a prototype called leader, centroid, etc. But these prototypes can not be used to estimate the probability density, since the density information present in the data set is lost while deriving the prototypes. The chapter proposes to use a modified leader clustering method called the counted-leader method which along with deriving the leaders preserves the crucial density information in the form of a count which can be used in estimating the densities. The chapter presents a fast and efficient nearest prototype based classifier called the counted k-nearest leader classifier (ck-NLC) which is on-par with the conventional k-NNC, but is considerably faster than the k-NNC. The chapter also presents a density based clustering method called l-DBSCAN which is shown to be a faster and scalable version of DBSCAN (Viswanath & Rajwala, 2006). Formally, under some assumptions, it is shown that the number of leaders is upper-bounded by a constant which is independent of the data set size and the distribution from which the data set is drawn.

Download Full-text

Big Qual: Defining and Debating Qualitative Inquiry for Large Data Sets

International Journal of Qualitative Methods ◽

10.1177/1609406919880692 ◽

2019 ◽

Vol 18 ◽

pp. 160940691988069 ◽

Cited By ~ 1

Author(s):

Rebecca L. Brower ◽

Tamara Bertrand Jones ◽

La’Tara Osborne-Lampkin ◽

Shouping Hu ◽

Toby J. Park-Gaghan

Keyword(s):

Qualitative Data ◽

Large Data ◽

Qualitative Inquiry ◽

Large Data Sets ◽

Computer Assisted ◽

Data Sets ◽

Sampling Strategies ◽

Analysis Software ◽

Technological Advances ◽

Qualitative Data Analysis Software

Big qualitative data (Big Qual), or research involving large qualitative data sets, has introduced many newly evolving conventions that have begun to change the fundamental nature of some qualitative research. In this methodological essay, we first distinguish big data from big qual. We define big qual as data sets containing either primary or secondary qualitative data from at least 100 participants analyzed by teams of researchers, often funded by a government agency or private foundation, conducted either as a stand-alone project or in conjunction with a large quantitative study. We then present a broad debate about the extent to which big qual may be transforming some forms of qualitative inquiry. We present three questions, which examine the extent to which large qualitative data sets offer both constraints and opportunities for innovation related to funded research, sampling strategies, team-based analysis, and computer-assisted qualitative data analysis software (CAQDAS). The debate is framed by four related trends to which we attribute the rise of big qual: the rise of big quantitative data, the growing legitimacy of qualitative and mixed methods work in the research community, technological advances in CAQDAS, and the willingness of government and private foundations to fund large qualitative projects.

Download Full-text

Estimating Intersection Control Delay Using Large Data Sets of Travel Time from a Global Positioning System

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198105191700103 ◽

2005 ◽

Vol 1917 (1) ◽

pp. 18-27

Author(s):

Brian Hoeschen ◽

Darcy Bullock ◽

Mark Schlappi

Keyword(s):

Travel Time ◽

Traffic Engineering ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Set ◽

Control Delay ◽

Diverse Data ◽

Intersection Control ◽

Better Than

Historically, stopped delay was used to characterize the operation of intersection movements because it was relatively easy to measure. During the past decade, the traffic engineering community has moved away from using stopped delay and now uses control delay. That measurement is more precise but quite difficult to extract from large data sets if strict definitions are used to derive the data. This paper evaluates two procedures for estimating control delay. The first is based on a historical approximation that control delay is 30% larger than stopped delay. The second is new and based on segment delay. The procedures are applied to a diverse data set collected in Phoenix, Arizona, and compared with control delay calculated by using the formal definition. The new approximation was observed to be better than the historical stopped delay procedure; it provided an accurate prediction of control delay. Because it is an approximation, this methodology would be most appropriately applied to large data sets collected from travel time studies for ranking and prioritizing intersections for further analysis.

Download Full-text

GAUSSPY+: A fully automated Gaussian decomposition package for emission line spectra

Astronomy and Astrophysics ◽

10.1051/0004-6361/201935519 ◽

2019 ◽

Vol 628 ◽

pp. A78 ◽

Cited By ~ 15

Author(s):

M. Riener ◽

J. Kainulainen ◽

J. D. Henshaw ◽

J. H. Orkisz ◽

C. E. Murray ◽

...

Keyword(s):

Emission Line ◽

Velocity Structure ◽

Galactic Plane ◽

Spatial Coherence ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Test Field ◽

Data Set ◽

Gaussian Decomposition

Our understanding of the dynamics of the interstellar medium is informed by the study of the detailed velocity structure of emission line observations. One approach to study the velocity structure is to decompose the spectra into individual velocity components; this leads to a description of the data set that is significantly reduced in complexity. However, this decomposition requires full automation lest it become prohibitive for large data sets, such as Galactic plane surveys. We developed GAUSSPY+, a fully automated Gaussian decomposition package that can be applied to emission line data sets, especially large surveys of HI and isotopologues of CO. We built our package upon the existing GAUSSPY algorithm and significantly improved its performance for noisy data. New functionalities of GAUSSPY+ include: (i) automated preparatory steps, such as an accurate noise estimation, which can also be used as stand-alone applications; (ii) an improved fitting routine; (iii) an automated spatial refitting routine that can add spatial coherence to the decomposition results by refitting spectra based on neighbouring fit solutions. We thoroughly tested the performance of GAUSSPY+ on synthetic spectra and a test field from the Galactic Ring Survey. We found that GAUSSPY+ can deal with cases of complex emission and even low to moderate signal-to-noise values.

Download Full-text

Data segmentation based on the local intrinsic dimension

Scientific Reports ◽

10.1038/s41598-020-72222-0 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Michele Allegra ◽

Elena Facco ◽

Francesco Denti ◽

Alessandro Laio ◽

Antonietta Mira

Keyword(s):

High Dimensional Data ◽

Large Data ◽

Large Data Sets ◽

High Dimensional ◽

Data Sets ◽

Imaging Data ◽

Unsupervised Segmentation ◽

Real World Data ◽

Data Set ◽

Intrinsic Dimension

Abstract One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.

Download Full-text

Machine Learning Methods for Demand Estimation

The American Economic Review ◽

10.1257/aer.p20151021 ◽

2015 ◽

Vol 105 (5) ◽

pp. 481-485 ◽

Cited By ~ 44

Author(s):

Patrick Bajari ◽

Denis Nekipelov ◽

Stephen P. Ryan ◽

Miaoyu Yang

Keyword(s):

Large Data ◽

Demand Estimation ◽

Large Data Sets ◽

Linear Functions ◽

Data Sets ◽

Science Literature ◽

Data Set ◽

Out Of Sample ◽

Out Of Sample Prediction ◽

Scanner Panel

We survey and apply several techniques from the statistical and computer science literature to the problem of demand estimation. To improve out-of-sample prediction accuracy, we propose a method of combining the underlying models via linear regression. Our method is robust to a large number of regressors; scales easily to very large data sets; combines model selection and estimation; and can flexibly approximate arbitrary non-linear functions. We illustrate our method using a standard scanner panel data set and find that our estimates are considerably more accurate in out-of-sample predictions of demand than some commonly used alternatives.

Download Full-text