PI: Eugenia San Segundo (UNED)
Team members: Victoria Marrero (UNED), Jonathan Delgado (ULL), Jorge Gurlekian, Pedro Univaso y Humberto Torres (CONICET).
Dates: 2022 - 2025
Funder: Spanish Ministry of Science and Innovation
TÍTULO DEL PROYECTO (ACRÓNIMO): ¿Qué hace humana a una voz? Hacia una mejor comprensión de las características fonéticas que permiten distinguir voces reales de deepfakes (HowDIs: How Deepfake Is your voice)
TITLE OF THE PROJECT (ACRONYM): How deepfake is your voice? Understanding the linguistic foundations of deepfakes (HowDIs)
This project aims to bridge the gap between the latest research carried out in the field of telecommunications engineering, aimed at designing automatic systems to avoid spoofing attacks, and the linguistic-phonetic knowledge that can help distinguish real voices from fake voices, helping therefore to detect videos and audios that are fake but very realistic (the so-called “deepfakes”). Although this last scientific area (that is, the linguistic area) has not yet been sufficiently explored in the aforementioned context, we consider that it has great potential when it comes to addressing one of the most important security challenges faced all over the world nowadays. We refer to deepfakes, a term used to designate videos or audios that, without being real, seem so due to an extreme and sophisticated manipulation, carried out using artificial intelligence techniques. Generally, this manipulation occurs in the facial and sound domain; in other words, what is most frequently transformed are people’s faces and voices. In particular, in the field of voice (which is tackled both by phoneticians and speech engineers), due to the growing development of deep neural networks, there is a real threat that deepfakes will create audios of such extreme realism that it will not possible to distinguish them from real voices. It is clear that if these technological advances fall into the hands of criminals, such as fraudsters, there is a serious risk of major security breaches, taking into account that access to bank accounts through voice authentication is increasingly common. These types of situations are known as spoofing attacks. However, beyond these risks, the investigations carried out in the field of deepfakes have a broader impact on society, since the manipulation of voices -of politicians, for example- to create false news is increasingly frequent. This is leading to an unprecedented loss of confidence in the media, with clear repercussions on the decisions that citizens make in political life, as shown by some recent examples (e.g. fake news in the Brexit campaign, or in the last elections presidential elections in the US). In this project we propose that the fight against voice deepfakes should be led by interdisciplinary research groups beyond the engineering teams that are currently dealing with the design of anti-spoofing attack systems. Forensic phoneticians have extensive experience in comparing very similar voices (such as the voices of twins), for which they analyze a great variety of acoustic or perceptual-type vocal parameters.The characterization of voices in Spanish can be approached from multiple angles. As a result of previous studies, it has been observed that human voices present a common denominator: high intra-speaker variation, which is due to emotional, pragmatic-contextual factors, related to the health of the speaker, etc. This type of variation is difficult to control voluntarily; therefore, it can be said that it constitutes what makes us human. In short, this project will aim to answer the question of “what makes a voice human” as a first step to be able to distinguish when we are dealing with a deepfake.