Lip sync

From Second Life Wiki
Revision as of 11:09, 2 December 2007 by Mm Alder (talk | contribs)
Jump to navigation Jump to search

"Q:How can you tell when a politician is lying? A:His lips are moving."

Well I guess that makes SecondLife a very honest place. Although SecondLife now has Voice Chat, avatars lips do not move. Instead, we see green waves above their heads to indicate who is speaking. And, although this helps pick out the speaker in a crowd, it really is not useful for machinima.

I've been looking at the code to see how lipsync might be implemented. Avatars use morphs for the mouth shapes used for emotes (LLEmote derived from LLMotion) but these are triggered and have fade-in and fade-out times. For lipsync, the duration is not known at the time the viseme starts and the transition time is much shorter than those used for emotes.

Lost you there? OK, let's break it down.

You can consider three types of lipsync. The most elaborate provides accurate mouth shapes and jaw movements (visemes) based on the sound that is spoken (phones). For text-to-speech (TTS) this is pretty straightforward since there is an intermediate representation in most TTS systems between the text and the speech that is the phone sequence with timing information. It is a fairly simple task to translate the phones to visemes which then get implemented as facial morphs (LLVisualParam). For live speech, the audio has to be decoded into a phone sequence, which is a fairly complicated task, but it can be done.

A simpler form of lipsync just looks at the relative energy of the speech signal and uses only two or three visemes to represent the mouth movement. This still requires decoding of the audio stream, but it's basically just a refinement of automatic gain control.

The crudest form of lipsync just loops a babble animation while the speaker is speaking. This it what you'll find used for a lot of Japanese animated shows on TV because it doesn't really matter which language is used for the sound track.

The choice of which of these three forms is used depends primarily on the level of reality expected, which in turn depends on the level of reality of the animated characters. For SecondLife, the energy-based lipsync with three visemes is probably appropriate, from "Express_Closed_Mouth" for silence or low energy to two weightings with "Express_Open_Mouth" for quiet and loud sounds.

The visual params for lipsync could be implemented with a looping LLMotion with zero fade-in and fade-out that is activated and deactivated as the visemes change or the visual params could be set directly dependent on the state of the speech.

Because lipsync would use the same visual params as the emotes, we would either have to disable emotes during lipsync or blend the two together. Emotes usually blend to "Express_Closed_Mouth" but it wouldn't be very difficult to blend to the visual params for speech instead.

We might want to go further though and apply the emotes for longer durations during the speech so that we could have smiling speech or pouting speech.

Unlike gestures, which are sent from the server, lipsync would happen entirely on the client. This is the only way to ensure synchronization. There would have to be some way of identifying the audio channel with the correct avatar. The visemes would be computed from the streamed audio as the buffered audio is being sent to the sound system.

The Agent's own speech would have to be decoded directly with as little buffering as possible so that it appears realistic to the viewer when the camera is positioned to view the Agent's face.