Lip sync

"Q:How can you tell when a politician is lying? A:His lips are moving."

Well I guess that makes SecondLife a very honest place. Although SecondLife now has Voice Chat, avatars lips do not move. Instead, we see green waves above their heads to indicate who is speaking. And, although this helps pick out the speaker in a crowd, it really is not useful for machinima.

Introduction

I've been looking at the code to see how lipsync might be implemented. Avatars use morphs for the mouth shapes used for emotes (LLEmote derived from LLMotion) but these are triggered and have fade-in and fade-out times. For lipsync, the duration is not known at the time the viseme starts and the transition time is much shorter than those used for emotes.

Lost you there? OK, let's break it down.

Gestures

One of the first things you discover when you enter SecondLife is that you can make your avatar dance by typing in the chat bar. Gestures can also make your avatar smile, frown, and otherwise contort his face. So you'd think we'd have the raw material for creating lip sync by using gestures. But...we don't.

See, when you type a gesture name into the chat bar, that is sent to the Linden Lab server somewhere across the world. The server sends a message to every user whose avatar is in the vicinity of your avatar telling them that your avatar wants to dance or whatever. Their viewers (and your viewer too) check to see whether they have that particular animation cached nearby. If not, the viewer asks the server to look up the animation in its database and send it to them. When the viewers get the animation they (and you) can finally see you dance. And if the gesture includes sound, you might notice that the sound plays out of sync with the animation if either has to be fetched from the server.

So you've probably figured that bouncing bits across the world networks is no way to get fine synchronization between speech and mouth movement. But, well, the pieces are there. It's just a matter of putting them together the right way.

Morphs

Avatars are nothing but skin and bones. No, really. An avatar actually does have a skeleton with bones. Virtual bones, anyway. And the bones are covered with skin. Well, triangulated skin anyway, called meshes.

Avatars can move their body parts in two ways. They can move their bones, and have the corresponding skin follow along stiffly, or they can stretch and distort their skin. When you change your avatar's appearance you're really stretching its skin and changing the length of its bones. These are adjustments to your avatar's visual parameters. The stretching and contorting uses a 3D construction called morphs. Some morphs change your avatar's appearance and other morphs make him move.

The difference between appearance morphs and animation morphs is just a practical one. For animation morphs, you can be sure that every avatar has them set to the same values. So after the animation is done, you can set each morph back to its default value. If you tried using appearance morphs for animation, you wouldn't know where to start of finish for each particular avatar. Yeah, in theory, you could do some relative calculation, but 3D is hard enough already.

Now unfortunately for us doing lip sync, most of the morphs that come in the Linden avatar definition file are meant for expressing emotions: surprise, embarrassed, shrug, kiss, bored, repulsed, disdain, tooth smile. And although a tooth smile kind of looks like someone saying "see," it's a stretch. (Yeah, I like puns.) But, it's all we've got.

Phonemes

So far, we've been just looking. Let's stop and listen for a moment.

When we write or type, we use letters, and as we all know, words are made of letters. But when we talk, they're not. The little pieces of a spoken word that take the place of letters are called phonemes. Sometimes there is a one to one correspondence between letters and phonemes. In US English, that's true for V, I think, but not much else. That's why [photi] sounds like "fish." But linguists (that's what they call people who make a living dreaming about photi and [colorless green ideas] while furiously sleeping) make up new alphabets to name phonemes. SH is a phoneme that sounds like "shh." Well, of course it would. And AX is that unstressed "a" sound called a schwa. At least to linguists who work on speech recognition. Most everybody else turns a [little "e" upside down]. Anybody else just get an image of [Dale Jr. in a bad wreck]? Wrong crowd, I guess.

Visemes

Now let's go back to looking.

When you say a word starting with the letter "f", it has a distinctive sound, and it has a distinctive appearance: your teeth touch your bottom lip. But when you say the letter "v", it has a different sound, yet it looks just like an "f": your teeth touch your bottom lip. You might have guessed that the photi fans have a name for these things that look alike. Yep, they call them visemes. Something like "visible phonemes" but shorter.

When someone is lipreading, they are translating visemes into phonemes. And since several phonemes can share the same look, you can see why lipreading is so difficult. Did he say he's "fairy wed" or "very wet"?

Now lip sync, then, is like lipreading in reverse. We need to map phonemes to visemes.

Verisimilitude

You can consider three types of lip sync. The most elaborate provides accurate mouth shapes and jaw movements (visemes) based on the sound that is spoken (phonemes). If the voice is created synthetically, also known as text-to-speech (TTS), this is pretty straightforward since there is an intermediate representation in most TTS systems between the text and the speech that is the phoneme sequence with timing information. It is a fairly simple task to translate the phonemes to visemes which then get implemented as facial morphs.

For live speech, it isn't so easy. Most of the lip sync that you see in movies like Shrek and The Incredibles is hand crafted for accuracy and uses a large set of 3D morphing targets as visemes.

The audio can be automatically decoded into a phoneme sequence, which is a fairly complicated task, but it can be done using speech recognition technology. I won't be shy about giving a shameless plug for [the IBM ViaVoice Toolkit for Animation], which is my own project.

Similitude

A simpler form of lip sync just looks at the relative energy of the speech signal and uses a smaller set of visemes to represent the mouth movement. This still requires decoding of the audio stream, but it's easier than phonetic decoding. You can see an example of this technology in [the IBM ViaVoice Toolkit for Animation] as well. For those technically inclined: it's basically just a refinement of automatic gain control.

Tude

The crudest form of lip sync just loops a babble animation while the speaker is speaking. That is, it just repeats the same sequence of frames over and over while the character is talking. The visuals are not actually synchronized to the audio; they just start and stop at the same time. This it what you'll find used for anime and a lot of Japanese animated shows on TV because it doesn't really matter which language is used for the sound track. The characters don't have to be reanimated for each language.

Lip sync for SecondLife

Unlike gestures, which are sent from the server, lip sync must happen entirely on the client. This is the only way to ensure synchronization.

The choice of which of the three forms of lip sync to use depends on the level of reality expected, which in turn depends on the level of reality of the animated characters. For SecondLife, the energy-based lip sync is probably appropriate. We don't need to implement realistic visemes, so the lack of a nice set of usable morphs is not a problem, but...there's another problem.

Voice chat

SecondLife has a voice chat feature that lets avatars speak, so we have the audio stream, but unfortunately, we can't get to it. The audio for voice chat never enters the SecondLife viewer. Instead it is processed by a parallel task called SLVoice, written for Linden Lab by Vivox. SLVoice is not currently open source, but Linden Lab has expressed a desire to make it so in the future.

Voice visualization

But the viewer does get some information from SLVoice, in the form of ParticipantPropertiesEvent messages. These messages provide a measure of the speech energy, but it is averaged over several phonemes, so they cannot provide enough detail for energy based lip sync. They are used to generate the green waves above the avatar's head indicating how loud someone is speaking.

Oohs and Aahs

So we can only generate babble loops with the information provided.

We can use the "Express_Closed_Mouth" morph for silence and the "Express_Open_Mouth" morph for loud sounds. Morphing allows us to blend between the morph targets, so we can get any level of mouth opening by weighting these two targets.

The "Express_Open_Mouth" morph is the viseme for the "aah" sound. It turns out that the "Express_Kiss" morph looks similar to the viseme for the "ooh" sound. So we can get a variety of different mouth shapes by blending the three morph. By using different length loops for "ooh" and "aah", we effectively create a loop whose length is the least common multiple of the two loop lengths. (And you never thought you'd ever hear about least common multiples after high school.)

Unfortunately, there is a problem using the "Express_Kiss" morph. It not only purses the lips, it also closes the eyes and lowers the eyebrows. This gives the avatar a nervous appearance if the morph is changed to quickly, and it gives a tired appearance if done too much. But in moderation, these eye features actually add some expressiveness to the face.

Energetic Oohs and Aahs

In the future we hope to have access to the audio stream allowing us to compute precise energy values. We can map these to the "aah" morph, while using a babble loop for the "ooh" morph to give us some variety.

The "ooh" sounds, created by pursing the lips, have a more bass quality to them than "aah" sounds. This is a reflection of their audio spectral features, called formants. It should be possible to make a simple analysis of the speech spectrum to get a real estimate of when to increase the "ooh" morph amount, rather than just babbling it. This could provide a realism better than simple energy based lip sync, though still below phonetic lip sync.

Says who

Right now, the audio streams for all avatars other than your own are combined together. In order to do good quality energy-based lip sync, we would need a way of identifying the audio with the correct avatar.

Smile when you say that

Because lip sync uses the same visual params as the emotes, we either have to disable emotes during lip sync or blend the two together. Emotes usually blend to "Express_Closed_Mouth" but it might not be too difficult to blend to the visual params for speech instead.