Perception of audio deepfakes in Spanish and Japanese: effects of language, speaking style, and voice familiarity

Perception of audio deepfakes in Spanish and Japanese: effects of language, speaking style, and voice familiarity

Abstract

Deepfakes are posing significant challenges to forensic phonetics, undermining citizen security and trust in digital media. Thus, understanding the human ability to distinguish synthetic audio from authentic audio is crucial in addressing this growing threat. Using PsychoPy, we conducted a perceptual experiment in which participants classified real and fake audio samples. The test featured Spanish and Japanese stimuli distributed to Spanish native speakers to examine the impact of language knowledge on performance. Previous studies have explored this variable, whose results we aim to compare with our findings. Additionally, this study evaluates how speaking style (interviews vs. text reading) and familiarity with the speaker’s voice impact performance.The experiment includes 80 voice samples (M=10.15 s), 50% real and 50% fake. For the real interview samples, we selected 10 Spanish stimuli from VoxCeleb-ESP and 10 Japanese stimuli from EACELEB. For the 20 real text-reading samples, 20 Spanish and Japanese were sourced from LibriVox and YouTube audiobooks. Furthermore, these 40 real stimuli (interviews and text reading) were cloned using Eleven Labs to generate their synthetic counterparts.

Date
Location
Málaga, Spain