Dataset for file fragment classification of textual file formats

Abstract Objectives Classification of textual file formats is a topic of interest in network forensics. There are a few publicly available datasets of files with textual formats. Therewith, there is no public dataset for file fragments of textual file formats. So, a big research challenge in file fragment classification of textual file formats is to compare the performance of the developed methods over the same datasets. Data description In this study, we present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97–Word 2003, Microsoft Word open XML format, portable document format, rich text file, and standard text document. This dataset contains the file fragments in three different languages: English, Persian, and Chinese. For each pair of file format and language, 1500 file fragments are provided. So, the dataset of file fragments contains 22,500 file fragments.

Download Full-text

Dataset for file fragment classification of audio file formats

BMC Research Notes ◽

10.1186/s13104-019-4856-1 ◽

2019 ◽

Vol 12 (1) ◽

Cited By ~ 3

Author(s):

Atieh Khodadadi ◽

Mehdi Teimouri

Keyword(s):

File Format ◽

Network Forensics ◽

Audio File ◽

Data Description ◽

File Formats ◽

Public Dataset ◽

Audio Files ◽

Research Challenge

Abstract Objectives File fragment classification of audio file formats is a topic of interest in network forensics. There are a few publicly available datasets of files with audio formats. Therewith, there is no public dataset for file fragments of audio file formats. So, a big research challenge in file fragment classification of audio file formats is to compare the performance of the developed methods over the same datasets. Data description In this study, we present a dataset that contains file fragments of 20 audio file formats: AMR, AMR-WB, AAC, AIFF, CVSD, FLAC, GSM-FR, iLBC, Microsoft ADPCM, MP3, PCM, WMA, A-Law, µ-Law, G.726, G.729, Microsoft GSM, OGG Vorbis, OPUS, and SPEEX. Corresponding to each format, the dataset contains the file fragments of audio files with different compression settings. For each pair of file format and compression setting, 210 file fragments are provided. Totally, the dataset contains 20,160 file fragments.

Download Full-text

Research on the Three-Dimensional Displaying of STL ASCII and Binary File

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.940.433 ◽

2014 ◽

Vol 940 ◽

pp. 433-436 ◽

Cited By ~ 1

Author(s):

Ying Zhang ◽

Xin Shi

Keyword(s):

Detailed Analysis ◽

Programming Language ◽

Coordinate Transformation ◽

Graphical Representation ◽

Three Dimensional ◽

File Format ◽

Binary File ◽

Stl File ◽

File Formats ◽

Three Dimensional Display

Based on the detailed analysis of the STL file format, VC++ 6.0 programming language was used to extract the STL ASCII and binary file information, at the same time, using the OpenGL triangle drawing technology for graphical representation of the STL file, with rendering functions such as material, coordinate transformation, lighting, et al, finally realizing the loading and three-dimensional display of STL ASCII and binary file formats.

Download Full-text

Hierarchical Scheme for Vehicle Make and Model Recognition

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/03611981211019743 ◽

2021 ◽

pp. 036119812110197

Author(s):

Chaoqing Wang ◽

Junlong Cheng ◽

Yuefei Wang ◽

Yurong Qian

Keyword(s):

Intelligent Transportation Systems ◽

Recognition Accuracy ◽

Transportation Systems ◽

Feature Descriptor ◽

Vehicle Recognition ◽

Public Dataset ◽

Classification Of Images ◽

Model Recognition ◽

Hierarchical Scheme

A vehicle make and model recognition (VMMR) system is a common requirement in the field of intelligent transportation systems (ITS). However, it is a challenging task because of the subtle differences between vehicle categories. In this paper, we propose a hierarchical scheme for VMMR. Specifically, the scheme consists of (1) a feature extraction framework called weighted mask hierarchical bilinear pooling (WMHBP) based on hierarchical bilinear pooling (HBP) which weakens the influence of invalid background regions by generating a weighted mask while extracting features from discriminative regions to form a more robust feature descriptor; (2) a hierarchical loss function that can learn the appearance differences between vehicle brands, and enhance vehicle recognition accuracy; (3) collection of vehicle images from the Internet and classification of images with hierarchical labels to augment data for solving the problem of insufficient data and low picture resolution and improving the model’s generalization ability and robustness. We evaluate the proposed framework for accuracy and real-time performance and the experiment results indicate a recognition accuracy of 95.1% and an FPS (frames per second) of 107 for the framework for the Stanford Cars public dataset, which demonstrates the superiority of the method and its availability for ITS.

Download Full-text

PDF: The "P" Stands for Problematic

Weave Journal of Library User Experience ◽

10.3998/weaveux.279 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Julia Caffrey-Hill ◽

Nathan Clark ◽

Brent Davis ◽

William Helman

Keyword(s):

Higher Education ◽

Visually Impaired ◽

Portable Document Format ◽

Limited Budget ◽

File Formats ◽

Level Of Effort ◽

Document Format

The Portable Document Format (PDF) is one of the most common document file types in academia, both in the library and the classroom. Unfortunately, PDF poses unique barriers to accessibility, particularly for the visually impaired. Ensuring that all people can read PDF content can be complex and expensive. There are alternative formats that can be made accessible with a lower level of effort, providing a better experience for both the end reader and the document author. This article serves as a call to arms for higher education to migrate away from PDF and to urge the tech community to develop new file formats that lend themselves to enhanced accessibility on a limited budget.

Download Full-text

SIDELOADING – INGESTION OF LARGE POINT CLOUDS INTO THE APACHE SPARK BIG DATA ENGINE

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xli-b2-343-2016 ◽

2016 ◽

Vol XLI-B2 ◽

pp. 343-348

Author(s):

J. Boehm ◽

K. Liu ◽

C. Alis

Keyword(s):

Big Data ◽

Point Cloud ◽

Point Clouds ◽

Geospatial Data ◽

Apache Spark ◽

Cloud Data ◽

Binary File ◽

Data Framework ◽

File Formats ◽

Data Ingestion

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.

Download Full-text

Application of Latent Dirichlet Allocation (LDA) for clustering financial tweets

E3S Web of Conferences ◽

10.1051/e3sconf/202129701071 ◽

2021 ◽

Vol 297 ◽

pp. 01071

Author(s):

Sifi Fatima-Zahrae ◽

Sabbar Wafae ◽

El Mzabi Amal

Keyword(s):

Language Processing ◽

Latent Dirichlet Allocation ◽

Sentiment Classification ◽

Research Areas ◽

Preprocessing Method ◽

Long Time ◽

Standard Text ◽

The Given ◽

Dirichlet Allocation

Sentiment classification is one of the hottest research areas among the Natural Language Processing (NLP) topics. While it aims to detect sentiment polarity and classification of the given opinion, requires a large number of aspect extractions. However, extracting aspect takes human effort and long time. To reduce this, Latent Dirichlet Allocation (LDA) method have come out recently to deal with this issue.In this paper, an efficient preprocessing method for sentiment classification is presented and will be used for analyzing user’s comments on Twitter social network. For this purpose, different text preprocessing techniques have been used on the dataset to achieve an acceptable standard text. Latent Dirichlet Allocation has been applied on the obtained data after this fast and accurate preprocessing phase. The implementation of different sentiment analysis methods and the results of these implementations have been compared and evaluated. The experimental results show that the combined uses of the preprocessing method of this paper and Latent Dirichlet Allocation have an acceptable results compared to other basic methods.

Download Full-text

Lost in migration: document quality for batch conversion to PDF/A

Library Hi Tech ◽

10.1108/lht-10-2017-0220 ◽

2018 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Roland Erwin Suri ◽

Mohamed El-Saad

Keyword(s):

Large Scale ◽

Software Tool ◽

Batch Processing ◽

Digital Archives ◽

Adobe Acrobat ◽

Content Type ◽

Text Document ◽

File Formats ◽

Long Term Preservation

PurposeChanges in file format specifications challenge long-term preservation of digital documents. Digital archives thus often focus on specific file formats that are well suited for long-term preservation, such as the PDF/A format. Since only few customers submit PDF/A files, digital archives may consider converting submitted files to the PDF/A format. The paper aims to discuss these issues.Design/methodology/approachThe authors evaluated three software tools for batch conversion of common file formats to PDF/A-1b: LuraTech PDF Compressor, Adobe Acrobat XI Pro and 3-HeightsTMDocument Converter by PDF Tools. The test set consisted of 80 files, with 10 files each of the eight file types JPEG, MS PowerPoint, PDF, PNG, MS Word, MS Excel, MSG and “web page.”FindingsBatch processing was sometimes hindered by stops that required manual interference. Depending on the software tool, three to four of these stops occurred during batch processing of the 80 test files. Furthermore, the conversion tools sometimes failed to produce output files even for supported file formats: three (Adobe Pro) up to seven (LuraTech and 3-HeightsTM) PDF/A-1b files were not produced. Since Adobe Pro does not convert e-mails, a total of 213 PDF/A-1b files were produced. The faithfulness of each conversion was investigated by comparing the visual appearance of the input document with that of the produced PDF/A-1b document on a computer screen. Meticulous visual inspection revealed that the conversion to PDF/A-1b impaired the information content in 24 of the converted 213 files (11 percent). These reproducibility errors included loss of links, loss of other document content (unreadable characters, missing text, document part missing), updated fields (reflecting time and folder of conversion), vector graphics issues and spelling errors.Originality/valueThese results indicate that large-scale batch conversions of heterogeneous files to PDF/A-1b cause complex issues that need to be addressed for each individual file. Even with considerable efforts, some information loss seems unavoidable if large numbers of files from heterogeneous sources are migrated to the PDF/A-1b format.

Download Full-text

Digital File Formats for Digital Preservation

Digital Curation ◽

10.4018/978-1-5225-6921-3.ch010 ◽

2018 ◽

pp. 218-233

Author(s):

Mayank Yuvaraj

Keyword(s):

Digital Libraries ◽

Digital Preservation ◽

Institutional Repository ◽

File Format ◽

Institutional Repositories ◽

File Formats ◽

Digital File ◽

The Common ◽

Common Understanding

During the course of planning an institutional repository, digital library collections or digital preservation service it is inevitable to draft file format policies in order to ensure long term digital preservation, its accessibility and compatibility. Sincere efforts have been made to encourage the adoption of standard formats yet the digital preservation policies vary from library to library. The present paper is based against this background to present the digital preservation community with a common understanding of the common file formats used in the digital libraries or institutional repositories. The paper discusses both open and proprietary file formats for several media.

Download Full-text

Sentiment Classification of Bank Clients’ Reviews Written in the Polish Language

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.353.03 ◽

2021 ◽

pp. 43-56

Author(s):

Adam Piotr Idczak

Keyword(s):

Logistic Regression ◽

Comparative Analysis ◽

Text Classification ◽

Sentiment Classification ◽

Bayes Classifier ◽

Text Documents ◽

Text Document ◽

Polish Language ◽

Common Problems

It is estimated that approximately 80% of all data gathered by companies are text documents. This article is devoted to one of the most common problems in text mining, i. e. text classification in sentiment analysis, which focuses on determining document’s sentiment. Lack of defined structure of the text makes this problem more challenging. This has led to development of various techniques used in determining document’s sentiment. In this paper the comparative analysis of two methods in sentiment classification: naive Bayes classifier and logistic regression was conducted. Analysed texts are written in Polish language and come from banks. Classification was conducted by means of bag-of-n-grams approach where text document is presented as set of terms and each term consists of n words. The results show that logistic regression performed better.

Download Full-text

SUPPORT OF INFORMAL CARERS FOR PEOPLE AFTER A STROKE WITH CROWDSOURCING AND NATURAL LANGUAGE PROCESSING

Acta Electrotechnica et Informatica ◽

10.15546/aeei-2021-0013 ◽

2021 ◽

Vol 21 (3) ◽

pp. 3-10

Author(s):

Petr ŠALOUN ◽

◽

Barbora CIGÁNKOVÁ ◽

David ANDREŠIČ ◽

Lenka KRHUTOVÁ ◽

...

Keyword(s):

Language Processing ◽

Text Documents ◽

Data Set ◽

Text Document ◽

Long Time ◽

Informal Carers ◽

Effective Visualization ◽

Text Document Classification ◽

Lay Public

For a long time, both professionals and the lay public showed little interest in informal carers. Yet these people deals with multiple and common issues in their everyday lives. As the population is aging we can observe a change of this attitude. And thanks to the advances in computer science, we can offer them some effective assistance and support by providing necessary information and connecting them with both professional and lay public community. In this work we describe a project called “Research and development of support networks and information systems for informal carers for persons after stroke” producing an information system visible to public as a web portal. It does not provide just simple a set of information but using means of artificial intelligence, text document classification and crowdsourcing further improving its accuracy, it also provides means of effective visualization and navigation over the content made by most by the community itself and personalized on a level of informal carer’s phase of the care-taking timeline. In can be beneficial for informal carers as it allows to find a content specific to their current situation. This work describes our approach to classification of text documents and its improvement through crowdsourcing. Its goal is to test text documents classifier based on documents similarity measured by N-grams method and to design evaluation and crowdsourcing-based classification improvement mechanism. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful.

Download Full-text