Talk:Linden Lab Bounties

From Second Life Wiki
Jump to navigation Jump to search

Category:Bounties

This article should be added to Category:Bounties at the nearest available opportunity, as should any future articles pertaining to individual bounties.

Don't forget to create Category:Linden Lab Bounties at the appropriate time either. SignpostMarv Martin 19:24, 8 January 2007 (PST)

Keyboard Input

I have been thining briefly about the keyboard hack thing. Sounds to me like a classical case of having to implement a producer/consumer pattern. Have a set of platform dependent producers, possibly extendable via plugins, produce input events from, various devices, e.g. keyboard, mouse, etc.

Abstract the input events, in such a fashion, that new input event types are possible.

Either create a new thread for driving the producers/consumers or drive the producers and consumers.

At the same time, tools and toolset should be refactored as input event consumers, the current way of having globals defined for tools does not appear to be sufficiently flexible for future needs. Duffy Langdon

Move avatar mouths in a believable way based on voice audio chat streams being played

See User:SignpostMarv_Martin/Marv's_silly_ideas#Webcam_support

SignpostMarv Martin 08:55, 13 January 2007 (PST)

I've been looking at the code to see how lipsync might be implemented. Avatars use morphs for the mouth shapes used for emotes (LLEmote derived from LLMotion) but these are triggered and have fade-in and fade-out times. For lipsync, the duration is not known at the time the viseme starts and the transition time is much shorter than those used for emotes.

You can consider three types of lipsync. The most elaborate provides accurate mouth shapes and jaw movements (visemes) based on the sound that is spoken (phones). For text-to-speech (TTS) this is pretty straightforward since there is an intermediate representation in most TTS systems between the text and the speech that is the phone sequence with timing information. It is a fairly simple task to translate the phones to visemes which then get implemented as facial morphs (LLVisualParam). For live speech, the audio has to be decoded into a phone sequence, which is a fairly complicated task, but it can be done.

A simpler form of lipsync just looks at the relative energy of the speech signal and uses only two or three visemes to represent the mouth movement. This still requires decoding of the audio stream, but it's basically just a refinement of automatic gain control.

The crudest form of lipsync just loops a babble animation while the speaker is speaking. This it what you'll find used for a lot of Japanese animated shows on TV because it doesn't really matter which language is used for the sound track.

The choice of which of these three forms is used depends primarily on the level of reality expected, which in turn depends on the level of reality of the animated characters. For SecondLife, the energy-based lipsync with three visemes is probably appropriate, from "Express_Closed_Mouth" for silence or low energy to two weightings with "Express_Open_Mouth" for quiet and loud sounds.

The visual params for lipsync could be implemented with a looping LLMotion with zero fade-in and fade-out that is activated and deactivated as the visemes change or the visual params could be set directly dependent on the state of the speech.

Because lipsync would use the same visual params as the emotes, we would either have to disable emotes during lipsync or blend the two together. Emotes usually blend to "Express_Closed_Mouth" but it wouldn't be very difficult to blend to the visual params for speech instead.

We might want to go further though and apply the emotes for longer durations during the speech so that we could have smiling speech or pouting speech.

Unlike gestures, which are sent from the server, lipsync would happen entirely on the client. This is the only way to ensure synchronization. There would have to be some way of identifying the audio channel with the correct avatar. The visemes would be computed from the streamed audio as the buffered audio is being sent to the sound system.

The Agent's own speech would have to be decoded directly with as little buffering as possible so that it appears realistic to the viewer when the camera is positioned to view the Agent's face.

All of this is, as they say, a simple matter of programming. --Mm Alder 07:48, 23 February 2007 (PST)