Voice deepfakes and the voices of identical twins: a preliminary acoustic analysis

Abstract

Until very recently, if one were asked what are the most possible similar voices, the answer would probably be: siblings’ voices, twins’ voices, or the voices of close relatives. The reply still holds true for ‘real’ voices (human voices), but it can be nuanced if consider the disruptive arrival of artificial intelligence in our lives. A deepfake, or artificially cloned voice seems like the most similar voice to a person’s voice. But how similar can it really be? Is this similarity comparable to the similarity found between the voices of identical twins?
In this preliminary study, we chose two pairs of identical (i.e. monozygotic, MZ) male twins from the Twin Corpus (Standard Peninsular Spanish). Two main criteria were established in order to select the two most similar-sounding twin pairs from the corpus: (i) similar age and (ii) similar Euclidean distance (ED) between each speaker and his twin. EDs were based on the perceptual assessment of their voice quality using a simplified version of the Vocal Profile Analysis.
We analyzed acoustically 12 voice samples, around 3-5 seconds long each, extracted from semi-directed spontaneous conversations. There were two samples per twin member and a further voice sample consisting of an artificially cloned voice using a commercial AI voice cloning tool.
A total of 21 acoustic parameters related to voice quality were analyzed using VOXplot. Preliminary results show parameter-dependent as well as twin-dependent differences. For instance, GNE (Glottal-to-noise excitation ratio) or LTAS show large intra-speaker, intra-twin and real-fake similarity in one twin pair, but a larger difference when analyzing the fake sample of a particular speaker in the other twin pair. This highlights the importance of undertaking the analysis of more twin pairs in a larger study.

Date
Location
Faro (Portugal)