A dataset of Photos and Videos for Digital Forensics Analysis Using Machine Learning Processing

Yüklə 1,38 Mb.

Pdf görüntüsü

səhifə	1/13
tarix	02.06.2023
ölçüsü	1,38 Mb.
	#127568

1 2 3 4 5 6 7 8 9 ... 13

A Dataset of Photos and Videos for Digital Forensics Analysis

* , Mário Antunes 2,3, * and Manuel E. Correia 1,3 Citation
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright

data
Data Descriptor
A Dataset of Photos and Videos for Digital Forensics Analysis
Using Machine Learning Processing
Sara Ferreira
1,
* , Mário Antunes
2,3,
*
and Manuel E. Correia
1,3


Citation:
Ferreira, S.; Antunes, M.;
Correia, M.E. A Dataset of Photos and
Videos for Digital Forensics Analysis
Using Machine Learning Processing.
Data 2021, 6, 87. https://doi.org/
10.3390/data6080087
Academic Editor: Joaquín
Torres-Sospedra
Received: 7 July 2021
Accepted: 3 August 2021
Published: 5 August 2021
Publisher’s Note:
MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright:
© 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Department of Computer Science, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal;
mdcorrei@fc.up.pt
2
Computer Science and Communication Research Centre (CIIC), School of Technology and Management,
Polytechnic of Leiria, 2411-901 Leiria, Portugal
3
INESC TEC, CRACS, 4200-465 Porto, Portugal
*
Correspondence: sara.ferreira@fc.up.pt (S.F.); mario.antunes@ipleiria.pt (M.A.)
Abstract:
Deepfake and manipulated digital photos and videos are being increasingly used in a
myriad of cybercrimes. Ransomware, the dissemination of fake news, and digital kidnapping-related
crimes are the most recurrent, in which tampered multimedia content has been the primordial
disseminating vehicle. Digital forensic analysis tools are being widely used by criminal investigations
to automate the identification of digital evidence in seized electronic equipment. The number of
files to be processed and the complexity of the crimes under analysis have highlighted the need to
employ efficient digital forensics techniques grounded on state-of-the-art technologies. Machine
Learning (ML) researchers have been challenged to apply techniques and methods to improve the
automatic detection of manipulated multimedia content. However, the implementation of such
methods have not yet been massively incorporated into digital forensic tools, mostly due to the
lack of realistic and well-structured datasets of photos and videos. The diversity and richness of
the datasets are crucial to benchmark the ML models and to evaluate their appropriateness to be
applied in real-world digital forensics applications. An example is the development of third-party
modules for the widely used Autopsy digital forensic application. This paper presents a dataset
obtained by extracting a set of simple features from genuine and manipulated photos and videos,
which are part of state-of-the-art existing datasets. The resulting dataset is balanced, and each entry
comprises a label and a vector of numeric values corresponding to the features extracted through
a Discrete Fourier Transform (DFT). The dataset is available in a GitHub repository, and the total
amount of photos and video frames is 40,588 and 12,400, respectively. The dataset was validated
and benchmarked with deep learning Convolutional Neural Networks (CNN) and Support Vector
Machines (SVM) methods; however, a plethora of other existing ones can be applied. Generically,
the results show a better F1-score for CNN when comparing with SVM, both for photos and videos
processing. CNN achieved an F1-score of 0.9968 and 0.8415 for photos and videos, respectively.
Regarding SVM, the results obtained with 5-fold cross-validation are 0.9953 and 0.7955, respectively,
for photos and videos processing. A set of methods written in Python is available for the researchers,
namely to preprocess and extract the features from the original photos and videos files and to build
the training and testing sets. Additional methods are also available to convert the original PKL files
into CSV and TXT, which gives more flexibility for the ML researchers to use the dataset on existing
ML frameworks and tools.

Yüklə 1,38 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 13