IBM security researchers have demonstrated how hackers can exploit generative artificial intelligence (Gen AI) and deep fake audio technology to hijack and manipulate live conversations.
The researchers devised a method they dubbed “audio-jacking,” which allows threat actors to intercept a speaker's audio and seamlessly replace snippets of authentic voice with deep fake replicas. Unlike traditional deep fake methods that create entirely fabricated voices, this technique operates in real-time, dynamically modifying the conversation based on context.
“We discovered a way to intercept a live conversation and replace keywords based on the context,” explained the researchers. “Rather than creating a fake voice for the entire conversation, which is relatively easy to detect, our method allows for subtle alterations that blend seamlessly into the dialogue.”
Using such technology, hackers could employ various scenarios, including malware installed on victims' devices or compromising Voice over IP (VoIP) services. Additionally, sophisticated social engineering tactics could be utilized to initiate conversations between targeted individuals.
To demonstrate the attack scenario, the researchers developed a proof-of-concept (PoC) acting as a man-in-the-middle, monitoring live conversations. Using speech-to-text conversion and language model understanding, the program dynamically modifies sentences when specific keywords are mentioned.
In the proof-of-concept scenario, the researchers instructed the system to modify a sentence related to bank accounts, but, according to the team, the technique could be adapted to target financial information, including accounts on mobile applications and digital payment services.