Hybrid analysis of textual data

Purpose – The purpose of this paper is to provide insightful evidence of phenomena in organization and management theory. Textual data sets consist of two different elements, namely qualitative and quantitative aspects. Researchers often combine methods to harness both aspects. However, they frequently do this in a comparative, convergent, or sequential way. Design/methodology/approach – The paper illustrates and discusses a hybrid textual data analysis approach employing the qualitative software application GABEK-WinRelan in a case study of an Austrian retail bank. Findings – The paper argues that a hybrid analysis method, fully intertwining qualitative and quantitative analysis simultaneously on the same textual data set, can deliver new insight into more facets of a data set. Originality/value – A hybrid approach is not a universally applicable solution to approaching research and management problems. Rather, this paper aims at triggering and intensifying scientific discussion about stronger integration of qualitative and quantitative data and analysis methods in management research.

Download Full-text

Comparative study on textual data set using fuzzy clustering algorithms

Kybernetes ◽

10.1108/k-11-2015-0301 ◽

2016 ◽

Vol 45 (8) ◽

pp. 1232-1242 ◽

Cited By ~ 3

Author(s):

Rjiba Sadika ◽

Moez Soltani ◽

Saloua Benammou

Keyword(s):

Comparative Study ◽

Clustering Algorithms ◽

Fuzzy Model ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Fuzzy C Means ◽

Textual Data ◽

Tunisian Revolution ◽

Fuzzy C Means Algorithm

Purpose The purpose of this paper is to apply the Takagi-Sugeno (T-S) fuzzy model techniques in order to treat and classify textual data sets with and without noise. A comparative study is done in order to select the most accurate T-S algorithm in the textual data sets. Design/methodology/approach From a survey about what has been termed the “Tunisian Revolution,” the authors collect a textual data set from a questionnaire targeted at students. Five clustering algorithms are mainly applied: the Gath-Geva (G-G) algorithm, the modified G-G algorithm, the fuzzy c-means algorithm and the kernel fuzzy c-means algorithm. The authors examine the performances of the four clustering algorithms and select the most reliable one to cluster textual data. Findings The proposed methodology was to cluster textual data based on the T-S fuzzy model. On one hand, the results obtained using the T-S models are in the form of numerical relationships between selected keywords and the rest of words constituting a text. Consequently, it allows the authors to interpret these results not only qualitatively but also quantitatively. On the other hand, the proposed method is applied for clustering text taking into account the noise. Originality/value The originality comes from the fact that the authors validate some economical results based on textual data, even if they have not been written by experts in the linguistic fields. In addition, the results obtained in this study are easy and simple to interpret by the analysts.

Download Full-text

Building holistic and agile collection development and assessment

Performance Measurement and Metrics ◽

10.1108/pmm-12-2014-0041 ◽

2015 ◽

Vol 16 (1) ◽

pp. 62-85 ◽

Cited By ~ 8

Author(s):

Cheri Jeanette Duncan ◽

Genya Morgan O'Gara

Keyword(s):

Quantitative Data ◽

Holistic Approach ◽

Assessment Model ◽

Data Sets ◽

Collection Development ◽

Content Type ◽

Qualitative And Quantitative ◽

Institutional Goals ◽

Scholarly Output ◽

Use Of Data

Purpose – The purpose of this paper is to examine the development of a flexible collections assessment rubric comprised of a suite of tools for more consistently and effectively evaluating and expressing a holistic value of library collections to a variety of constituents, from administrators to faculty and students, with particular emphasis to the use of data already being collected at libraries to “take the temperature” of how responsive collections are in supporting institutional goals. Design/methodology/approach – Using a literature review, internal and external conversations, several collections pilot projects, and a variety of other investigative mechanisms, this paper explores methods for creating a more flexible, holistic collection development and assessment model using both qualitative and quantitative data. Findings – The products of scholarship that academic libraries include in their collections are expanding exponentially and range from journals and monographs in all formats, to databases, data sets, digital text and images, streaming media, visualizations and animations. Content is also being shared in new ways and on a variety of platforms. Yet the framework for evaluating this new landscape of scholarly output is in its infancy. So, how do libraries develop and assess collections in a consistent, holistic, yet agile, manner? Libraries must employ a variety of mechanisms to ensure this goal, while remaining flexible in adapting to the shifting collections environment. Originality/value – In so much as the authors are aware, this is the first paper to examine an agile, holistic approach to collections using both qualitative and quantitative data.

Download Full-text

A systematic review of machine learning-based missing value imputation techniques

Data Technologies and Applications ◽

10.1108/dta-12-2020-0298 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Tressy Thomas ◽

Enayat Rajabi

Keyword(s):

Machine Learning ◽

Selection Process ◽

Evaluation Metrics ◽

Correct Prediction ◽

Data Sets ◽

Data Set ◽

Missing Value ◽

Content Type ◽

Missing Value Imputation ◽

Literature Reviews

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

An application of data envelopment analysis for Korean banks with negative data

Benchmarking An International Journal ◽

10.1108/bij-02-2016-0023 ◽

2017 ◽

Vol 24 (4) ◽

pp. 1052-1064 ◽

Cited By ~ 9

Author(s):

Yong Joo Lee ◽

Seong-Jong Joo ◽

Hong Gyun Park

Keyword(s):

Data Envelopment Analysis ◽

Data Sets ◽

Data Envelopment ◽

Negative Data ◽

Ownership Type ◽

Data Set ◽

Translation Invariant ◽

Content Type ◽

Dea Models ◽

Regional Banks

Purpose The purpose of this paper is to measure the comparative efficiency of 18 Korean commercial banks under the presence of negative observations and examine performance differences among them by grouping them according to their market conditions. Design/methodology/approach The authors employ two data envelopment analysis (DEA) models such as a Banker, Charnes, and Cooper (BCC) model and a modified slacks-based measure of efficiency (MSBM) model, which can handle negative data. The BCC model is proven to be translation invariant for inputs or outputs depending on output or input orientation. Meanwhile, the MSBM model is unit invariant in addition to translation invariant. The authors compare results from both models and choose one for interpreting results. Findings Most Korean banks recovered from the worst performance in 2011 and showed similar performance in recent years. Among three groups such as national banks, regional banks, and special banks, the most special banks demonstrated superb performance across models and years. Especially, the performance difference between the special banks and the regional banks was statistically significant. The authors concluded that the high performance of the special banks was due to their nationwide market access and ownership type. Practical implications This study demonstrates how to analyze and measure the efficiency of entities when variables contain negative observations using a data set for Korean banks. The authors have tried two major DEA models that are able to handle negative data and proposed a practical direction for future studies. Originality/value Although there are research papers for measuring the performance of banks in Korea, all of the papers in the topic have studied efficiency or productivity using positive data sets. However, variables such as net incomes and growth rates frequently include negative observations in bank data sets. This is the first paper to investigate the efficiency of bank operations in the presence of negative data in Korea.

Download Full-text

A scalable eigenspace-based fuzzy c-means for topic detection

Data Technologies and Applications ◽

10.1108/dta-11-2020-0262 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Hendri Murfi

Keyword(s):

Representation Learning ◽

Detection Methods ◽

Data Sets ◽

Topic Detection ◽

Data Set ◽

Content Type ◽

Running Time ◽

Fuzzy C Means ◽

Coherence Score ◽

Value Decomposition

PurposeThe aim of this research is to develop an eigenspace-based fuzzy c-means method for scalable topic detection.Design/methodology/approachThe eigenspace-based fuzzy c-means (EFCM) combines representation learning and clustering. The textual data are transformed into a lower-dimensional eigenspace using truncated singular value decomposition. Fuzzy c-means is performed on the eigenspace to identify the centroids of each cluster. The topics are provided by transforming back the centroids into the nonnegative subspace of the original space. In this paper, we extend the EFCM method for scalability by using the two approaches, i.e. single-pass and online. We call the developed topic detection methods as oEFCM and spEFCM.FindingsOur simulation shows that both oEFCM and spEFCM methods provide faster running times than EFCM for data sets that do not fit in memory. However, there is a decrease in the average coherence score. For both data sets that fit and do not fit into memory, the oEFCM method provides a tradeoff between running time and coherence score, which is better than spEFCM.Originality/valueThis research produces a scalable topic detection method. Besides this scalability capability, the developed method also provides a faster running time for the data set that fits in memory.

Download Full-text

A systematical approach to classification problems with feature space heterogeneity

Kybernetes ◽

10.1108/k-06-2018-0313 ◽

2019 ◽

Vol 48 (9) ◽

pp. 2006-2029

Author(s):

Hongshan Xiao ◽

Yu Wang

Keyword(s):

Factor Analysis ◽

Meta Analysis ◽

Feature Space ◽

Classification Performance ◽

Classification Algorithm ◽

Significant Feature ◽

Data Sets ◽

Data Set ◽

Classification Techniques ◽

Content Type

Purpose Feature space heterogeneity exists widely in various application fields of classification techniques, such as customs inspection decision, credit scoring and medical diagnosis. This paper aims to study the relationship between feature space heterogeneity and classification performance. Design/methodology/approach A measurement is first developed for measuring and identifying any significant heterogeneity that exists in the feature space of a data set. The main idea of this measurement is derived from a meta-analysis. For the data set with significant feature space heterogeneity, a classification algorithm based on factor analysis and clustering is proposed to learn the data patterns, which, in turn, are used for data classification. Findings The proposed approach has two main advantages over the previous methods. The first advantage lies in feature transform using orthogonal factor analysis, which results in new features without redundancy and irrelevance. The second advantage rests on samples partitioning to capture the feature space heterogeneity reflected by differences of factor scores. The validity and effectiveness of the proposed approach is verified on a number of benchmarking data sets. Research limitations/implications Measurement should be used to guide the heterogeneity elimination process, which is an interesting topic in future research. In addition, to develop a classification algorithm that enables scalable and incremental learning for large data sets with significant feature space heterogeneity is also an important issue. Practical implications Measuring and eliminating the feature space heterogeneity possibly existing in the data are important for accurate classification. This study provides a systematical approach to feature space heterogeneity measurement and elimination for better classification performance, which is favorable for applications of classification techniques in real-word problems. Originality/value A measurement based on meta-analysis for measuring and identifying any significant feature space heterogeneity in a classification problem is developed, and an ensemble classification framework is proposed to deal with the feature space heterogeneity and improve the classification accuracy.

Download Full-text

US academic libraries' staffing and expenditure trends (1996–2016)

Library Management ◽

10.1108/lm-12-2019-0093 ◽

2020 ◽

Vol 41 (4/5) ◽

pp. 247-268 ◽

Cited By ~ 1

Author(s):

Starr Hoffman ◽

Samantha Godbey

Keyword(s):

Academic Libraries ◽

The United States ◽

Data Sets ◽

Size Category ◽

Data Set ◽

Content Type ◽

Education Statistics ◽

Trends Over Time ◽

Doctoral Institutions ◽

Over Time

PurposeThis paper explores trends over time in library staffing and staffing expenditures among two- and four-year colleges and universities in the United States.Design/methodology/approachResearchers merged and analyzed data from 1996 to 2016 from the National Center for Education Statistics for over 3,500 libraries at postsecondary institutions. This study is primarily descriptive in nature and addresses the research questions: How do staffing trends in academic libraries over this period of time relate to Carnegie classification and institution size? How do trends in library staffing expenditures over this period of time correspond to these same variables?FindingsAcross all institutions, on average, total library staff decreased from 1998 to 2012. Numbers of librarians declined at master’s and doctoral institutions between 1998 and 2016. Numbers of students per librarian increased over time in each Carnegie and size category. Average inflation-adjusted staffing expenditures have remained steady for master's, baccalaureate and associate's institutions. Salaries as a percent of library budget decreased only among doctoral institutions and institutions with 20,000 or more students.Originality/valueThis is a valuable study of trends over time, which has been difficult without downloading and merging separate data sets from multiple government sources. As a result, few studies have taken such an approach to this data. Consequently, institutions and libraries are making decisions about resource allocation based on only a fraction of the available data. Academic libraries can use this study and the resulting data set to benchmark key staffing characteristics.

Download Full-text

A classification approach based on variable precision rough sets and cluster validity index function

Engineering Computations ◽

10.1108/ec-11-2012-0297 ◽

2014 ◽

Vol 31 (8) ◽

pp. 1778-1789

Author(s):

Hongkang Lin

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Index Method ◽

Data Set ◽

Content Type ◽

The Individual ◽

Variable Precision Rough Sets ◽

Optimal Number Of Clusters

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Download Full-text

Biodiversity of Active and Inactive Bacteria in the Gut Flora of Wood-Feeding Huhu Beetle Larvae (Prionoplus reticularis)

Applied and Environmental Microbiology ◽

10.1128/aem.05609-11 ◽

2011 ◽

Vol 77 (19) ◽

pp. 7000-7006 ◽

Cited By ~ 63

Author(s):

Nicola M. Reid ◽

Sarah L. Addison ◽

Lucy J. Macdonald ◽

Gareth Lloyd-Jones

Keyword(s):

Genomic Dna ◽

Data Sets ◽

Data Set ◽

Content Type ◽

Dna And Rna ◽

New Information ◽

Related Organism ◽

The Family ◽

Insect Symbionts ◽

Derived Data

ABSTRACTHuhu grubs (Prionoplus reticularis) are wood-feeding beetle larvae endemic to New Zealand and belonging to the family Cerambycidae. Compared to the wood-feeding lower termites, very little is known about the diversity and activity of microorganisms associated with xylophagous cerambycid larvae. To address this, we used pyrosequencing to evaluate the diversity of metabolically active and inactive bacteria in the huhu larval gut. Our estimate, that the gut harbors at least 1,800 phylotypes, is based on 33,420 sequences amplified from genomic DNA and reverse-transcribed RNA. Analysis of genomic DNA- and RNA-derived data sets revealed that 71% of all phylotypes (representing 95% of all sequences) were metabolically active. Rare phylotypes contributed considerably to the richness of the community and were also largely metabolically active, indicating their participation in digestive processes in the gut. The dominant families in the active community (RNA data set) includedAcidobacteriaceae(24.3%),Xanthomonadaceae(16.7%),Acetobacteraceae(15.8%),Burkholderiaceae(8.7%), andEnterobacteriaceae(4.1%). The most abundant phylotype comprised 14% of the active community and affiliated withDyella ginsengisoli(Gammaproteobacteria), suggesting that aDyella-related organism is a likely symbiont. This study provides new information on the diversity and activity of gut-associated microorganisms that are essential for the digestion of the nutritionally poor diet consumed by wood-feeding larvae. Many huhu gut phylotypes affiliated with insect symbionts or with bacteria present in acidic environments or associated with fungi.

Download Full-text