Lip sync

From Second Life Wiki
Revision as of 13:41, 2 August 2008 by Gally Young (talk | contribs) (add new cat)
Jump to navigation Jump to search

Basic lip sync has been added to version 1.20 RC8, but it is disabled by default. To enable it, you first have to enable the "Advanced" menu by pressing Ctrl-Alt-D (all three keys together, but not to be confused with Ctrl-Alt-Delete). Then in the Advanced menu select Character, then Enable Lip Sync (Beta). Whenever someone uses voice chat and you see the green waves above the avatar's head, you should also see the avatars lips move. When an avatar has an attachment covering its head, though, the attachment is not animated. Sorry, furries.

Introduction

So let's look at a little background for lips sync. Avatars use morphs for the mouth shapes used for emotes (LLEmote derived from LLMotion) but these are triggered and have fade-in and fade-out times. For lipsync, the duration is not known at the time the viseme starts and the transition time is much shorter than those used for emotes.

Lost you there? OK, let's break it down.

Gestures

One of the first things you discover when you enter SecondLife is that you can make your avatar dance by typing in the chat bar. Gestures can also make your avatar smile, frown, and otherwise contort his face. So you'd think we'd have the raw material for creating lip sync by using gestures. But...we don't.

See, when you type a gesture name into the chat bar, that is sent to the Linden Lab server somewhere across the world. The server sends a message to every user whose avatar is in the vicinity of your avatar telling them that your avatar wants to dance or whatever. Their viewers (and your viewer too) check to see whether they have that particular animation cached nearby. If not, the viewer asks the server to look up the animation in its database and send it to them. When the viewers get the animation they (and you) can finally see you dance. And if the gesture includes sound, you might notice that the sound plays out of sync with the animation if either has to be fetched from the server.

So you've probably figured that bouncing bits across the world networks is no way to get fine synchronization between speech and mouth movement. But, well, the pieces are there. It's just a matter of putting them together the right way.

Morphs

Avatars are nothing but skin and bones. No, really. An avatar actually does have a skeleton with bones. Virtual bones, anyway. And the bones are covered with skin. Well, triangulated skin anyway, called meshes.

Avatars can move their body parts in two ways. They can move their bones, and have the corresponding skin follow along stiffly, or they can stretch and distort their skin. When you change your avatar's appearance you're really stretching its skin and changing the length of its bones. These are adjustments to your avatar's visual parameters. The stretching and contorting uses a 3D construction called morphs. Some morphs change your avatar's appearance and other morphs make him move.

The difference between appearance morphs and animation morphs is just a practical one. For animation morphs, you can be sure that every avatar has them set to the same values. So after the animation is done, you can set each morph back to its default value. If you tried using appearance morphs for animation, you wouldn't know where to start of finish for each particular avatar. Yeah, in theory, you could do some relative calculation, but 3D is hard enough already.

Now unfortunately for us doing lip sync, most of the morphs that come in the Linden avatar definition file are meant for expressing emotions: surprise, embarrassed, shrug, kiss, bored, repulsed, disdain, tooth smile. And although a tooth smile kind of looks like someone saying "see," it's a stretch. (Yeah, I like puns.) But, it's all we've got.

Phonemes

So far, we've been just looking. Let's stop and listen for a moment.

When we write or type, we use letters, and as we all know, words are made of letters. But when we talk, they're not. The little pieces of a spoken word that take the place of letters are called phonemes. Sometimes there is a one to one correspondence between letters and phonemes. In US English, that's true for V, I think, but not much else. That's why [photi] sounds like "fish." But linguists (that's what they call people who make a living dreaming about photi and [colorless green ideas] while furiously sleeping) make up new alphabets to name phonemes. SH is a phoneme that sounds like "shh." Well, of course it would. And AX is that unstressed "a" sound called a schwa. At least to linguists who work on speech recognition. Most everybody else turns a [little "e" upside down]. Anybody else just get an image of [Dale Jr. in a bad wreck]? Wrong crowd, I guess.

Visemes

Now let's go back to looking.

When you say a word starting with the letter "f", it has a distinctive sound, and it has a distinctive appearance: your teeth touch your bottom lip. But when you say the letter "v", it has a different sound, yet it looks just like an "f": your teeth touch your bottom lip. You might have guessed that the photi fans have a name for these things that look alike. Yep, they call them visemes. Something like "visible phonemes" but shorter.

When someone is lipreading, they are translating visemes into phonemes. And since several phonemes can share the same look, you can see why lipreading is so difficult. Did he say he's "fairy wed" or "very wet"?

Now lip sync, then, is like lipreading in reverse. We need to map phonemes to visemes.

Verisimilitude

You can consider three types of lip sync. The most elaborate provides accurate mouth shapes and jaw movements (visemes) based on the sound that is spoken (phonemes). If the voice is created synthetically, also known as text-to-speech (TTS), this is pretty straightforward since there is an intermediate representation in most TTS systems between the text and the speech that is the phoneme sequence with timing information. It is a fairly simple task to translate the phonemes to visemes which then get implemented as facial morphs.

For live speech, it isn't so easy. Most of the lip sync that you see in movies like Shrek and The Incredibles is hand crafted for accuracy and uses a large set of 3D morphing targets as visemes.

The audio can be automatically decoded into a phoneme sequence, which is a fairly complicated task, but it can be done using speech recognition technology. I won't be shy about giving a shameless plug for [the IBM ViaVoice Toolkit for Animation], which is my own project.

Similitude

A simpler form of lip sync just looks at the relative energy of the speech signal and uses a smaller set of visemes to represent the mouth movement. This still requires decoding of the audio stream, but it's easier than phonetic decoding. You can see an example of this technology in [the IBM ViaVoice Toolkit for Animation] as well. For those technically inclined: it's basically just a refinement of automatic gain control.

Tude

The crudest form of lip sync just loops a babble animation while the speaker is speaking. That is, it just repeats the same sequence of frames over and over while the character is talking. The visuals are not actually synchronized to the audio; they just start and stop at the same time. This it what you'll find used for anime and a lot of Japanese animated shows on TV because it doesn't really matter which language is used for the sound track. The characters don't have to be reanimated for each language.

Lip sync for SecondLife

Unlike gestures, which are sent from the server, lip sync must happen entirely on the client. This is the only way to ensure synchronization.

The choice of which of the three forms of lip sync to use depends on the level of reality expected, which in turn depends on the level of reality of the animated characters. For SecondLife, the energy-based lip sync is probably appropriate. We don't need to implement realistic visemes, so the lack of a nice set of usable morphs is not a problem, but...there's another problem.

Voice chat

SecondLife has a voice chat feature that lets avatars speak, so we have the audio stream, but unfortunately, we can't get to it. The audio for voice chat never enters the SecondLife viewer. Instead it is processed by a parallel task called SLVoice, written for Linden Lab by Vivox. SLVoice is not currently open source, but Linden Lab has expressed a desire to make it so in the future.

Voice visualization

But the viewer does get some information from SLVoice, in the form of ParticipantPropertiesEvent messages. These messages provide a measure of the speech energy, but it is averaged over several phonemes, so they cannot provide enough detail for energy based lip sync. They are used to generate the green waves above the avatar's head indicating how loud someone is speaking.

Oohs and Aahs

So we can only generate babble loops with the information provided.

We can use the "Express_Closed_Mouth" morph for silence and the "Express_Open_Mouth" morph for loud sounds. Morphing allows us to blend between the morph targets, so we can get any level of mouth opening by weighting these two targets.

The "Express_Open_Mouth" morph is the viseme for the "aah" sound. It turns out that the "Express_Kiss" morph looks similar to the viseme for the "ooh" sound. So we can get a variety of different mouth shapes by blending the three morph. "aah" gives us the vertical dimension and "ooh" gives us horizontal.

By using different length loops for "ooh" and "aah", we effectively create a loop whose length is the least common multiple of the two loop lengths. (And you never thought you'd ever hear about least common multiples after high school.)

Unfortunately, there is a problem using the "Express_Kiss" morph. It not only purses the lips, it also closes the eyes and lowers the eyebrows. This gives the avatar a nervous appearance if the morph is changed to quickly, and it gives a tired appearance if done too much.

So, can we extract just the mouth movement from the Express_Kiss morph and make our own Express_Ooh morph. Why not? When we look at the mesh file definition we note that all of the vertices used in the morphs are indexed to the mesh definition using the vertexIndex field. So we just take those vertices out to the Express_Kiss morph that are also used by Express_Open_Mouth or Express_Closed_Mouth and voila! we have an Express_Ooh morph. We add a visual param to the avatar definition file and there you have it.

Smile when you say that

Because lip sync uses the same visual params as the emotes, we either have to disable emotes during lip sync or blend the two together. As it turns out, morphs are made for blending. So with no extra work, it turns out that the emotes blend just fine with the lip sync morphs. Well, just about.

If we use Express_Open_Mouth in a gesture while we're doing lip sync, we get a conflict because both want to set different weights for the same morph. So we really want to have lips sync morphs separate from the emote morphs. So instead of Express_Ooh, we'll call it Lipsync_Ooh and we'll copy Express_Open_Mouth to Lipsync_Aah. We may get a mouth opened to twice its usual maximum, but the maximum was just arbitrary anyway.

There's still one catch, though. The base mesh defines a mouth with lips parted, but the default pose uses the Express_Closed_Mouth morph at its full weight. The emotes blend between the Express_... morphs and the Express_Closed_Mouth morph. We could make a Lipsync_Closed_Mouth morph to blend with the Lipsync_Ooh and Lipsync_Aah morphs, but then the default pose would have the mouth closed twice. We could just forget about a separate blending for lip sync, but then the Lipsync_Aah would not open the mouth as much as Express_Open_Mouth because it would be blended with the fully weighted Express_Closed_Mouth. So, we add the negative of the Express_Closed_Mouth to Lipsync_Ooh and Lipsync_Aah to get the same effect as the emotes and then we don't have to blend to a default.

Energetic Oohs and Aahs

In the future we hope to have access to the audio stream allowing us to compute precise energy values. We can map these to the "aah" morph, while using a babble loop for the "ooh" morph to give us some variety.

The "ooh" sounds, created by pursing the lips, have a more bass quality to them than "aah" sounds. This is a reflection of their audio spectral features, called formants. It should be possible to make a simple analysis of the speech spectrum to get a real estimate of when to increase the "ooh" morph amount, rather than just babbling it. This could provide a realism better than simple energy based lip sync, though still below phonetic lip sync.

Says who

Right now, the audio streams for all avatars other than your own are combined together. SLVoice tells us when each avatar starts speaking and when he stops speaking, and a little bit about how loud he is speaking, so the information is there, but it doesn't tell us how to untangle the audio. In order to do good quality energy-based lip sync, we would need a way of identifying the audio with the correct avatar.

How to babble

This section describes the settings that can be made in SecondLife for lip sync.

You can set your preferences directly in your Documents and Settings\<user>\Application Data\SecondLife\user_settings\settings.xml file, or you can set them from the Client menu of the viewer.

If you haven't already done so, you can enable the Client menu using Ctrl-Alt-D (not to be confused with Ctrl-Alt-Delete).

To get the Debug Settings window, select

Client -> Debug Settings

In the selection box, hit the triangle to bring up the list of debug variables and select the setting that you wish to change.

Here is a list of settings used for lip sync together with their default values and descriptions.

LipSyncEnabled
1
0 disables lip sync and 1 enables the babble loop. In the future there may be options 2 and on for other forms of lip sync.
LipSyncOohAahRate
24 (per second)
The rate at which the Ooh and Aah sequences are processed. The morph target is updated at this rate, but the rate at which the display gets updated still determines the actual frame rate of the rendering.
LipSyncOoh
1247898743223344444443200000
A sequence of digits that represent the amount of mouth puckering. This sequence is repeated while the speaker continues to speak. This drives one of the morphs for the mouth animation loop. A value "0" means no puckering. A value "9" maximizes the puckering morph. The sequence can be of any length. It need not be the same length as LipSyncAah. Setting the sequence to a single character essentially disables the loop, and the amount of puckering is just modulated by the Vivox power measurement. Setting it to just zeros completely disables the ooh morphing.
LipSyncAah
257998776531013446642343
A sequence of digits that represent the amount of jaw opening. This sequence is repeated while the speaker continues to speak. This drives one of the morphs for the mouth animation loop. A value "0" means closed. A value "9" maximizes the jaw opening. The sequence can be of any length. It need not be the same length as LipSyncOoh. Setting the sequence to a single character essentially disables the loop, and the amount of jaw opening is just modulated by the Vivox power measurement. Setting it to just zeros completely disables the aah morphing.
LipSyncOohPowerTransfer
0012345566778899
The amplitude of the animation loops for ooh and aah is modulated by the power measurements made by the Vivox voice client. This function provides a transfer function for the ooh modulation. The ooh sound is not directly related to the speech power, so this isn't a linear function. The sequence can be of any length. Setting it to a single digit essentially disable the modulation and keeps it at a fixed value.
LipSyncAahPowerTransfer
0000123456789
The amplitude of the animation loops for ooh and aah is modulated by the power measurements made by the Vivox voice client. This function provides a transfer function for the aah modulation. The aah sound is pretty well correlated with the speech power, but to prevent low power noise from making the lips move, we put a few zeros at the start of this sequence. The sequence can be of any length. Setting it to a single digit essentially disable the modulation and keeps it at a fixed value.

Tip for machinimators: If you have a close-up shot for which you want accurate lip sync and you're willing to hand tune the lip sync, here's how you can do that.

Record the phrase that you want to sync. If you have a tool that gives you phoneme sequence from an audio clip, use that to get the time intervals for each phone. If not, guess. Use that data to define the LipSyncAah parameter for the entire phrase. Include zeros at the end for the duration of the clip so the loop doesn't restart if you talk too long.

Set the LipSyncAahPowerTransfer to a single digit that will define your maximum lip sync amplitude. Set LipSyncOohPowerTransfer to a single zero to disable the ooh babble. Then speak without stopping for the duration of the original phrase as you record the video. You can say anything, you're not using this audio, you just want to keep the babble loop going. Finally, combine the recorded audio and video to see how well it matches. Note where you need to adjust the babble loop.

After you have the aahs figured out, enable the ooh loop by setting LipSyncOohPowerTransfer to a single digit (probably the max, 9). Then you can set the oohs using the LipSyncOoh parameter to adjust the mouth widths.

It's getting better all the time

This is the hopefully just the first step of several to implement lip sync in SecondLife. It's still not very good, but it's better than nothing. We hope to get real-time energy-based lip-sync if we can get access to the audio streams. After that, we hope to be able to automatically estimate the appropriate amount of ooh morphing to make the mouth movement more realistic. Stay tuned.