A Dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

Deepfake and manipulated digital photos and videos are being increasingly used in a myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related crimes are the most recurrent, in which tampered multimedia content has been the primordial disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations to automate the identification of digital evidence in seized electronic equipment. The number of files to be processed and the complexity of the crimes under analysis have highlighted the need to employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine Learning (ML) researchers have been challenged to apply techniques and methods to improve the automatic detection of manipulated multimedia content. However, the implementation of such methods have not yet been massively incorporated into digital forensic tools, mostly due to the lack of realistic and well-structured datasets of photos and videos. The diversity and richness of the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be applied in real-world digital forensics applications. An example is the development of third-party modules for the widely used Autopsy digital forensic application. This paper presents a dataset obtained by extracting a set of simple features from genuine and manipulated photos and videos, which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry comprises a label and a vector of numeric values corresponding to the features extracted through a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically, the results show a better F1-score for CNN when comparing with SVM, both for photos and videos processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively. Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively, for photos and videos processing. A set of methods written in Python is available for the researchers, namely to preprocess and extract the features from the original photos and videos files and to build the training and testing sets. Additional methods are also available to convert the original PKL files into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing ML frameworks and tools.

Download Full-text

Exposing Manipulated Photos and Videos in Digital Forensics Analysis

Journal of Imaging ◽

10.3390/jimaging7070102 ◽

2021 ◽

Vol 7 (7) ◽

pp. 102

Author(s):

Sara Ferreira ◽

Mário Antunes ◽

Manuel E. Correia

Keyword(s):

Machine Learning ◽

Digital Forensics ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Approaches ◽

Video Frames ◽

Learning Techniques ◽

Compound Method ◽

Criminal Investigators ◽

Simple Features

Tampered multimedia content is being increasingly used in a broad range of cybercrime activities. The spread of fake news, misinformation, digital kidnapping, and ransomware-related crimes are amongst the most recurrent crimes in which manipulated digital photos and videos are the perpetrating and disseminating medium. Criminal investigation has been challenged in applying machine learning techniques to automatically distinguish between fake and genuine seized photos and videos. Despite the pertinent need for manual validation, easy-to-use platforms for digital forensics are essential to automate and facilitate the detection of tampered content and to help criminal investigators with their work. This paper presents a machine learning Support Vector Machines (SVM) based method to distinguish between genuine and fake multimedia files, namely digital photos and videos, which may indicate the presence of deepfake content. The method was implemented in Python and integrated as new modules in the widely used digital forensics application Autopsy. The implemented approach extracts a set of simple features resulting from the application of a Discrete Fourier Transform (DFT) to digital photos and video frames. The model was evaluated with a large dataset of classified multimedia files containing both legitimate and fake photos and frames extracted from videos. Regarding deepfake detection in videos, the Celeb-DFv1 dataset was used, featuring 590 original videos collected from YouTube, and covering different subjects. The results obtained with the 5-fold cross-validation outperformed those SVM-based methods documented in the literature, by achieving an average F1-score of 99.53%, 79.55%, and 89.10%, respectively for photos, videos, and a mixture of both types of content. A benchmark with state-of-the-art methods was also done, by comparing the proposed SVM method with deep learning approaches, namely Convolutional Neural Networks (CNN). Despite CNN having outperformed the proposed DFT-SVM compound method, the competitiveness of the results attained by DFT-SVM and the substantially reduced processing time make it appropriate to be implemented and embedded into Autopsy modules, by predicting the level of fakeness calculated for each analyzed multimedia file.

Download Full-text

MAPPING 2D TO 3D FORENSIC FACIAL RECOGNITION VIA BIO- INSPIRED ACTIVE APPEARANCE MODEL

Jurnal Teknologi ◽

10.11113/jt.v78.6939 ◽

2015 ◽

Vol 78 (2-2) ◽

Author(s):

Siti Zaharah Abd. Rahman ◽

Siti Norul Huda Sheikh Abdullah ◽

Lim Eng Hao ◽

Mohammed Hasan Abdulameer ◽

Nazri Ahmad Zamani ◽

...

Keyword(s):

Forensic Analysis ◽

Support Vector ◽

Appearance Model ◽

Active Appearance Model ◽

Video Footage ◽

Digital Forensic ◽

Facial Information ◽

Active Appearance ◽

2D Model ◽

2D To 3D

This research done is to solve the problems faced by digital forensic analysts in identifying a suspect captured on their CCTV. Identifying the suspect through the CCTV video footage is a very challenging task for them as it involves tedious rounds of processes to match the facial information in the video footage to a set of suspect’s images. The biggest problem faced by digital forensic analysis is modeling 2D model extracted from CCTV video as the model does not provide enough information to carry out the identification process. Problems occur when a suspect in the video is not facing the camera, the image extracted is the side image of the suspect and it is difficult to make a matching with portrait image in the database. There are also many factors that contribute to the process of extracting facial information from a video to be difficult, such as low-quality video. Through 2D to 3D image model mapping, any partial face information that is incomplete can be matched more efficiently with 3D data by rotating it to matched position. The first methodology in this research is data collection; any data obtained through video recorder. Then, the video will be converted into an image. Images are used to develop the Active Appearance Model (the 2D face model is AAM) 2D and AAM 3D. AAM is used as an input for learning and testing process involving three classifiers, which are Random Forest, Support Vector Machine (SVM), and Neural Networks classifier. The experimental results show that the 3D model is more suitable for use in face recognition as the percentage of the recognition is higher compared with the 2D model.

Download Full-text

State of the Art Survey of Deep Learning and Machine Learning Models for Smart Cities and Urban Sustainability

10.31219/osf.io/gmuzk ◽

2020 ◽

Author(s):

Saeed Nosratabadi ◽

Amir Mosavi ◽

Ramin Keivani ◽

Sina Faizollahzadeh Ardabili ◽

Farshid Aram

Keyword(s):

Machine Learning ◽

Deep Learning ◽

State Of The Art ◽

Smart Cities ◽

Model Development ◽

Urban Sustainability ◽

Urban Transport ◽

Support Vector ◽

Neuro Fuzzy ◽

Vector Machines

Deep learning (DL) and machine learning (ML) methods have recently contributed to the advancement of models in the various aspects of prediction, planning, and uncertainty analysis of smart cities and urban development. This paper presents the state of the art of DL and ML methods used in this realm. Through a novel taxonomy, the advances in model development and new application domains in urban sustainability and smart cities are presented. Findings reveal that five DL and ML methods have been most applied to address the different aspects of smart cities. These are artificial neural networks; support vector machines; decision trees; ensembles, Bayesians, hybrids, and neuro-fuzzy; and deep learning. It is also disclosed that energy, health, and urban transport are the main domains of smart cities that DL and ML methods contributed in to address their problems.

Download Full-text

A Novel Stacked Ensemble for Hate Speech Recognition

Applied Sciences ◽

10.3390/app112411684 ◽

2021 ◽

Vol 11 (24) ◽

pp. 11684

Author(s):

Mona Khalifa A. Aljero ◽

Nazife Dimililer

Keyword(s):

Machine Learning ◽

Hate Speech ◽

State Of The Art ◽

Majority Voting ◽

Support Vector ◽

Timely Manner ◽

Machine Learning Classifiers ◽

Significant Challenge ◽

Final Output ◽

Harmful Content

Detecting harmful content or hate speech on social media is a significant challenge due to the high throughput and large volume of content production on these platforms. Identifying hate speech in a timely manner is crucial in preventing its dissemination. We propose a novel stacked ensemble approach for detecting hate speech in English tweets. The proposed architecture employs an ensemble of three classifiers, namely support vector machine (SVM), logistic regression (LR), and XGBoost classifier (XGB), trained using word2vec and universal encoding features. The meta classifier, LR, combines the outputs of the three base classifiers and the features employed by the base classifiers to produce the final output. It is shown that the proposed architecture improves the performance of the widely used single classifiers as well as the standard stacking and classifier ensemble using majority voting. We also present results on the use of various combinations of machine learning classifiers as base classifiers. The experimental results from the proposed architecture indicated an improvement in the performance on all four datasets compared with the standard stacking, base classifiers, and majority voting. Furthermore, on three of these datasets, the proposed architecture outperformed all state-of-the-art systems.

Download Full-text

Digital Forensic Analysis of Cybercrimes

International Journal of Information Security and Privacy ◽

10.4018/ijisp.2017040103 ◽

2017 ◽

Vol 11 (2) ◽

pp. 25-37 ◽

Cited By ~ 2

Author(s):

Regner Sabillon ◽

Jordi Serra-Ruiz ◽

Victor Cavaller ◽

Jeimy J. Cano

Keyword(s):

Best Practices ◽

International Collaboration ◽

Forensic Analysis ◽

Lessons Learned ◽

Digital Ecosystems ◽

Cyber Forensics ◽

Digital Forensic ◽

Chain Of Custody ◽

Technological Ecosystems ◽

Forensic Tools

This paper reviews the existing methodologies and best practices for digital investigations phases like collecting, evaluating and preserving digital forensic evidence and chain of custody of cybercrimes. Cybercriminals are adopting new strategies to launch cyberattacks within modified and ever changing digital ecosystems, this article proposes that digital investigations must continually readapt to tackle cybercrimes and prosecute cybercriminals, working in international collaboration networks, sharing prevention knowledge and lessons learned. The authors also introduce a compact cyber forensics model for diverse technological ecosystems called Cyber Forensics Model in Digital Ecosystems (CFMDE). Transferring the knowledge, international collaboration, best practices and adopting new digital forensic tools, methodologies and techniques will be hereinafter paramount to obtain digital evidence, enforce organizational cybersecurity policies, mitigate security threats, fight anti-forensics practices and indict cybercriminals. The global Digital Forensics community ought to constantly update current practices to deal with cybercriminality and foreseeing how to prepare to new technological environments where change is always constant.

Download Full-text

Linear Support Vector Machines for Prediction of Student Performance in School-Based Education

Mathematical Problems in Engineering ◽

10.1155/2020/4761468 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7

Author(s):

Nalindren Naicker ◽

Timothy Adeliyi ◽

Jeanette Wing

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Student Performance ◽

State Of The Art ◽

Learning Algorithms ◽

The State ◽

Machine Learning Algorithms ◽

Superior Performance ◽

Support Vector ◽

Vector Machines

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.

Download Full-text

Systematic Memory Forensic Analysis of Ransomware using Digital Forensic Tools

International Journal of Natural Computing Research ◽

10.4018/ijncr.2020040105 ◽

2020 ◽

Vol 9 (2) ◽

pp. 61-81

Author(s):

Paul Joseph ◽

Jasmine Norman

Keyword(s):

Forensic Analysis ◽

Vital Role ◽

Financial Loss ◽

Cyber Forensics ◽

Digital Forensic ◽

Memory Forensics ◽

Forensic Tools

Cybercrimes catastrophically caused great financial loss in the year 2018 as powerful obfuscated malware known as ransomware continued to be a continual threat to governments and organizations. Advanced malwares capable of system encryption with sophisticated obscure keys left organizations paying the ransom that hackers demand. Since every individual is vulnerable to this assault, cyber forensics play a vital role either in educating society or combating the attacks. As cyber forensics is classified into many subdomains, memory forensics is the domain that leads in curbing these types of attacks. This article gives insight on importance of memory forensics and provides widespread analysis on working of ransomware, recognizes the workflow, provides the ways to overcome this attack. Furthermore, this article implements user defined rules by integrating into powerful search tools known as YARA to detect and prevent the ransomware attacks.

Download Full-text

Multi-Hazard Exposure Mapping Using Machine Learning Techniques: A Case Study from Iran

Remote Sensing ◽

10.3390/rs11161943 ◽

2019 ◽

Vol 11 (16) ◽

pp. 1943 ◽

Cited By ~ 15

Author(s):

Omid Rahmati ◽

Saleh Yousefi ◽

Zahra Kalantari ◽

Evelyn Uuemaa ◽

Teimur Teimurian ◽

...

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Characteristic Curve ◽

Machine Learning Techniques ◽

Support Vector ◽

Mountainous Area ◽

Data Set ◽

Boosted Regression Tree ◽

Hazard Exposure ◽

Learning Techniques

Mountainous areas are highly prone to a variety of nature-triggered disasters, which often cause disabling harm, death, destruction, and damage. In this work, an attempt was made to develop an accurate multi-hazard exposure map for a mountainous area (Asara watershed, Iran), based on state-of-the art machine learning techniques. Hazard modeling for avalanches, rockfalls, and floods was performed using three state-of-the-art models—support vector machine (SVM), boosted regression tree (BRT), and generalized additive model (GAM). Topo-hydrological and geo-environmental factors were used as predictors in the models. A flood dataset (n = 133 flood events) was applied, which had been prepared using Sentinel-1-based processing and ground-based information. In addition, snow avalanche (n = 58) and rockfall (n = 101) data sets were used. The data set of each hazard type was randomly divided to two groups: Training (70%) and validation (30%). Model performance was evaluated by the true skill score (TSS) and the area under receiver operating characteristic curve (AUC) criteria. Using an exposure map, the multi-hazard map was converted into a multi-hazard exposure map. According to both validation methods, the SVM model showed the highest accuracy for avalanches (AUC = 92.4%, TSS = 0.72) and rockfalls (AUC = 93.7%, TSS = 0.81), while BRT demonstrated the best performance for flood hazards (AUC = 94.2%, TSS = 0.80). Overall, multi-hazard exposure modeling revealed that valleys and areas close to the Chalous Road, one of the most important roads in Iran, were associated with high and very high levels of risk. The proposed multi-hazard exposure framework can be helpful in supporting decision making on mountain social-ecological systems facing multiple hazards.

Download Full-text

Unsupervised Action Proposals Using Support Vector Classifiers for Online Video Processing

Sensors ◽

10.3390/s20102953 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2953

Author(s):

Marcos Baptista Ríos ◽

Roberto Javier López-Sastre ◽

Francisco Javier Acevedo-Rodríguez ◽

Pilar Martín-Martín ◽

Saturnino Maldonado-Bascón

Keyword(s):

Video Processing ◽

State Of The Art ◽

Learning To Rank ◽

Support Vector ◽

New Approach ◽

Support Vector Classifier ◽

Current State ◽

Video Frames ◽

Unsupervised Approach ◽

Video Sensor

In this work, we introduce an intelligent video sensor for the problem of Action Proposals (AP). AP consists of localizing temporal segments in untrimmed videos that are likely to contain actions. Solving this problem can accelerate several video action understanding tasks, such as detection, retrieval, or indexing. All previous AP approaches are supervised and offline, i.e., they need both the temporal annotations of the datasets during training and access to the whole video to effectively cast the proposals. We propose here a new approach which, unlike the rest of the state-of-the-art models, is unsupervised. This implies that we do not allow it to see any labeled data during learning nor to work with any pre-trained feature on the used dataset. Moreover, our approach also operates in an online manner, which can be beneficial for many real-world applications where the video has to be processed as soon as it arrives at the sensor, e.g., robotics or video monitoring. The core of our method is based on a Support Vector Classifier (SVC) module which produces candidate segments for AP by distinguishing between sets of contiguous video frames. We further propose a mechanism to refine and filter those candidate segments. This filter optimizes a learning-to-rank formulation over the dynamics of the segments. An extensive experimental evaluation is conducted on Thumos’14 and ActivityNet datasets, and, to the best of our knowledge, this work supposes the first unsupervised approach on these main AP benchmarks. Finally, we also provide a thorough comparison to the current state-of-the-art supervised AP approaches. We achieve 41% and 59% of the performance of the best-supervised model on ActivityNet and Thumos’14, respectively, confirming our unsupervised solution as a correct option to tackle the AP problem. The code to reproduce all our results will be publicly released upon acceptance of the paper.

Download Full-text

Single-Cell Phenotype Classification Using Deep Convolutional Neural Networks

CrossRef Listing of Deleted DOIs ◽

10.1177/1087057116631284 ◽

2016 ◽

Vol 21 (9) ◽

pp. 998-1003 ◽

Cited By ~ 42

Author(s):

Oliver Dürr ◽

Beate Sick

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Single Cell ◽

Convolutional Neural Networks ◽

State Of The Art ◽

Misclassification Rate ◽

Support Vector ◽

Learning Methods ◽

Phenotype Classification

Deep learning methods are currently outperforming traditional state-of-the-art computer vision algorithms in diverse applications and recently even surpassed human performance in object recognition. Here we demonstrate the potential of deep learning methods to high-content screening–based phenotype classification. We trained a deep learning classifier in the form of convolutional neural networks with approximately 40,000 publicly available single-cell images from samples treated with compounds from four classes known to lead to different phenotypes. The input data consisted of multichannel images. The construction of appropriate feature definitions was part of the training and carried out by the convolutional network, without the need for expert knowledge or handcrafted features. We compare our results against the recent state-of-the-art pipeline in which predefined features are extracted from each cell using specialized software and then fed into various machine learning algorithms (support vector machine, Fisher linear discriminant, random forest) for classification. The performance of all classification approaches is evaluated on an untouched test image set with known phenotype classes. Compared to the best reference machine learning algorithm, the misclassification rate is reduced from 8.9% to 6.6%.

Download Full-text