Speaker identification of artificially cloned voices versus the voices of identical twins: a perceptual test.

Abstract

Previous perceptual studies on voice deepfakes are mostly online tests asking listeners to decide whether a voice is real or fake. In this study, we designed a multiple forced choice (same / different) listening experiment with Praat using 12 different voice samples, each around 3-5 seconds long, extracted from semi-directed spontaneous conversations of two pairs of twins. There are two different voice samples (a and b) per twin member (A and B), and each voice was also artificially cloned. The experiment was set up in Praat with 20 stimuli in 3 types of pairings: 8 same-speaker pairings, 8 different-speaker (intra-twin) pairings, and 4 combinations of a speaker with his voice deepfake.
30 listeners took part in the perceptual experiment under controlled conditions (same headphones, computer, and silent room). Listeners were college students, native speakers of Standard Peninsular Spanish. They were not informed that the stimuli included twin voices and deepfakes. The test duration was approximately 10 minutes, and reaction times were measured.
The objective of this investigation was to determine whether identification results (hits, misses, false alarms, and correct rejections), as well as reaction times, depend on the type of stimuli combination: same-speaker, different-speaker, or real-deepfake combinations. Our hypotheses are: (1) listeners will perform worse when listening to deepfakes than when listening to real voices, even if real voices are from identical twins, who are extreme cases of similarity in humans; (2) reaction times will be longer when listening to deepfakes.

Date
Location
Faro (Portugal)