Adobe Voco: Photoshop for Audio

This review of Adobe Voco is written by IGM Lecturer, Cody Van De Mark. 


Recently Adobe announced an experimental technology called #VoCo, “Photoshop for audio” [link]. The idea behind #VoCo is that it can pull the dialogue from audio clips as text that can then be edited. The edited text will change the audio in the speaker’s voice to say the newly edited phrase. At first, it sounded too good to be true, but Adobe’s engineers have never ceased to amaze. In the live demo of #VoCo, Adobe edited text and showed that the audio clip was updated to say the new phrase. Astonishingly, it sounded just like the original speaker’s voice. Even after multiple changes to the clip, the audio still sounded authentic. The speaker in the audio clip was saying entirely new phrases never heard in the original audio clip.

In terms of the games and media industry, this is incredibly exciting news. Numerous games and media experiences use voice recording [link]. In many products, it’s a crucial part of the experience. Currently any required changes in audio must be re-recorded with the original voice actor. These changes may come from alterations to the script, poorly recorded audio, background noises in the audio, misspoken words, stuttering or other recorded speech disfluencies. Adobe’s technology may ease that work by allowing editors to correct phrases without the need of the original voice actor or the need to rerecord the audio clip. In mere seconds, an editor could correct audio to remove disfluencies and change vocal phrases. 

We don’t know the current limitations of the #VoCo technology, but there is exciting potential for technology like this. Though it is probably beyond the current technology, it is wonderful to imagine a world where we could create compelling procedural dialogue audio in games. Many games use procedural dialogue to create randomized quests and player goals that keep the player engaged. If technology advances to a point that we can procedurally generate matching and compelling voice audio, this could add a lot of immersion to games.

It is already known that voice acting adds an extra layer of immersion to a game. Traditionally this has not worked well with procedural dialogue because that would mean a voice actor would need to record every possible combination of dialogue. Since procedural dialogue can create millions of combinations, it is nigh impossible to record audio clips for every phrase. Recording this many audio clips would also mean the game would need to store millions of additional files, which would drastically increase the installation size of a game. The idea of procedural voice generation could eventually solve that problem and create better player experiences. For the time being, we can be excited that Adobe has pushed technology further and eased the difficulty of voice recording. Hopefully this is a testament to the future of audio manipulation.