How I Created a Living Avatar for My AI Agent (OpenClaw/MoltBot)

Five minutes ago, she was just a text prompt. Now she has a voice, a face, and she's talking to you.

0:00

/0:08

Watch that video.

That avatar didn't exist 10 minutes before I recorded this. No stock footage. No motion capture. No expensive software. Just a text description, some API calls, and a pipeline I'm about to show you.

By the end of this guide, you'll be able to create your own AI avatar that:

Has a consistent, unique face
Speaks with lip-synced video
Uses any voice you want

Let's build it.

If you haven't subscribed yet, click the subscribe button to stay up to date on cool upcoming tutorials on OpenClaw

Why Give Your AI Agent an Avatar?

AI agents are powerful. They can manage your calendar, write code, answer questions, automate your life. But they're invisible.

Adding an avatar changes everything:

Emotional connection — It's easier to trust something you can see
Video messages — Your agent can send you personalized updates
The "wow" factor — This is the future, and you can build it today

Here's the avatar image that started that video:

One image. One text prompt. A working video avatar in minutes.

You can find my tutorial here on how to setup OpenClaw

The Pipeline

Here's what we're building:

[Text Prompt] 
    ↓ (Gemini Image Generation)
[Your Avatar's Face]
    ↓ (Veo 3 Video Generation)  
[Lip-Synced Video with Generated Voice]
    ↓ (Optional: ElevenLabs Speech-to-Speech)
[Video with YOUR Custom Voice]

The key insight: Veo 3 generates native lip sync. The mouth movements actually match the speech. Then we can optionally swap the voice to any voice we want using ElevenLabs.

Prerequisites

You'll need:

OpenClaw installed (docs.openclaw.ai)
Google AI Studio API key (with Veo 3 access)
ElevenLabs API key (optional, for custom voice)
FFmpeg (brew install ffmpeg on Mac)

Step 1: Create Your Avatar

First, we generate a consistent face for our avatar using Gemini's image generation.

# Using Gemini's image generation (adjust path to your setup)
uv run generate_image.py \
  --prompt "full body shot, professional woman with blonde hair, \
  confident smile, modern office setting, warm lighting, photorealistic" \
  --filename avatar.png \
  --resolution 1K

Tips for good avatar prompts:

Be specific about features (hair, expression, clothing)
Include the setting/background
Use "photorealistic" for realistic results
Full body or mid-shot works best for video

Building Consistency

Once you have a base avatar, use it as a reference for variations:

uv run generate_image.py \
  --prompt "same person in evening dress, city rooftop at sunset" \
  -i avatar.png \
  --filename avatar-evening.png

The -i flag uses your existing image as a reference, keeping the face consistent across different scenes.

The Image Generation Script

Here's the script that powers avatar creation:

#!/usr/bin/env python3
"""
generate_image.py - Create images with Gemini's image generation
"""

import argparse
import os
from pathlib import Path
from io import BytesIO

from google import genai
from google.genai import types
from PIL import Image as PILImage


def main():
    parser = argparse.ArgumentParser(description="Generate images with Gemini")
    parser.add_argument("--prompt", "-p", required=True, help="Image description")
    parser.add_argument("--filename", "-f", required=True, help="Output filename")
    parser.add_argument("--input-image", "-i", dest="input_image", help="Reference image for editing")
    parser.add_argument("--resolution", "-r", choices=["1K", "2K", "4K"], default="1K")
    args = parser.parse_args()

    # Initialize client (uses GEMINI_API_KEY env var)
    client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))

    # Build content - image first if editing, then prompt
    if args.input_image:
        input_img = PILImage.open(args.input_image)
        contents = [input_img, args.prompt]
        print(f"Editing image with prompt...")
    else:
        contents = args.prompt
        print(f"Generating new image...")

    # Generate
    response = client.models.generate_content(
        model="gemini-3-pro-image-preview",
        contents=contents,
        config=types.GenerateContentConfig(
            response_modalities=["TEXT", "IMAGE"],
            image_config=types.ImageConfig(image_size=args.resolution)
        )
    )

    # Save the image
    output_path = Path(args.filename)
    for part in response.parts:
        if part.inline_data is not None:
            image_data = part.inline_data.data
            image = PILImage.open(BytesIO(image_data))
            image.convert('RGB').save(str(output_path), 'PNG')
            print(f"✅ Saved: {output_path.resolve()}")
            return

    print("❌ No image generated")


if __name__ == "__main__":
    main()

Install dependencies:

pip install google-genai pillow
# or with uv:
uv pip install google-genai pillow

Set your API key:

export GEMINI_API_KEY="your-api-key-here"

Step 2: Generate Lip-Synced Video with Veo 3

Here's where the magic happens. Veo 3 can generate video with speech — and the lip sync is built-in.

from google import genai
from google.genai import types
import time

client = genai.Client()

# Load your avatar
with open("avatar.png", "rb") as f:
    image_bytes = f.read()

# The prompt is CRITICAL
prompt = """no music, no singing, no sound effects.
Person walking toward camera with confident smile.
Speaking directly to camera: 'Five minutes ago, I was just a text prompt. 
Now I have a voice, a face, and I'm talking to you. What will you create?'
Engaging eye contact, natural gestures."""

operation = client.models.generate_videos(
    model="veo-3.0-generate-001",
    prompt=prompt,
    image=types.Image(image_bytes=image_bytes, mime_type="image/png"),
)

# Wait for generation (usually 30-60 seconds)
for i in range(30):
    time.sleep(10)
    result = client.operations.get(operation)
    if result.done:
        if result.response.generated_videos:
            video = result.response.generated_videos[0]
            video_data = client.files.download(file=video.video)
            with open("avatar_video.mp4", "wb") as f:
                f.write(video_data)
            print("✅ Video generated!")
        else:
            print("❌ Generation failed or was filtered")
        break

Prompt Engineering Tips

Always start with:

no music, no singing, no sound effects.

This prevents Veo from adding background music or making your avatar randomly burst into song (yes, it will try).

Structure your prompt:

Audio instructions (no music, etc.)
Visual description (pose, movement, setting)
Exact dialogue in quotes
Body language cues

Example prompts for different vibes:

Professional briefing:

no music, no singing, no sound effects.
Person at desk, morning light through window.
Speaking professionally: 'Good morning. Here are your three priorities for today.'
Slight smile, making eye contact.

Casual check-in:

no music, no singing, no sound effects.
Person in cozy setting, relaxed posture.
Speaking warmly: 'Hey! Just wanted to check in. How's your day going?'
Friendly expression, natural gestures.

Step 3 (Optional): Swap to Your Custom Voice

The video from Veo 3 includes generated speech, but it's not your voice. If you want a specific voice, here's how to swap it.

Extract the original audio

ffmpeg -y -i avatar_video.mp4 \
  -vn -acodec pcm_s16le -ar 44100 -ac 1 \
  original_audio.wav

Convert with ElevenLabs Speech-to-Speech

from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="your-api-key")

with open("original_audio.wav", "rb") as f:
    audio_data = f.read()

response = client.speech_to_speech.convert(
    voice_id="your-voice-id",  # From ElevenLabs voice library
    audio=audio_data,
    model_id="eleven_english_sts_v2",
    output_format="mp3_44100_128"
)

with open("custom_voice.mp3", "wb") as f:
    for chunk in response:
        f.write(chunk)

The Speech-to-Speech API preserves the timing while changing the voice. This keeps the lip sync perfect!

Merge back together

ffmpeg -y \
  -i avatar_video.mp4 \
  -i custom_voice.mp3 \
  -c:v copy -c:a aac \
  -map 0:v:0 -map 1:a:0 \
  -shortest \
  final_video.mp4

The Complete Script

Here's everything packaged into one reusable function:

#!/usr/bin/env python3
"""
AI Avatar Video Generator
Create lip-synced videos with custom voice
"""

import os
import time
import subprocess
from pathlib import Path
from google import genai
from google.genai import types

def create_avatar_video(
    avatar_path: str,
    message: str,
    output_path: str = "avatar_video.mp4"
):
    """Generate a video of your avatar speaking."""
    
    client = genai.Client()
    
    # Load avatar image
    with open(avatar_path, "rb") as f:
        image_bytes = f.read()
    
    # Craft the prompt
    prompt = f"""no music, no singing, no sound effects.
    Person speaking naturally to camera with friendly expression.
    Speaking in a warm tone: '{message}'
    Natural body language, engaging eye contact."""
    
    # Generate video
    print("🎬 Generating video...")
    operation = client.models.generate_videos(
        model="veo-3.0-generate-001",
        prompt=prompt,
        image=types.Image(image_bytes=image_bytes, mime_type="image/png"),
    )
    
    for _ in range(30):
        time.sleep(10)
        result = client.operations.get(operation)
        if result.done:
            if result.response.generated_videos:
                video = result.response.generated_videos[0]
                video_data = client.files.download(file=video.video)
                with open(output_path, "wb") as f:
                    f.write(video_data)
                print(f"✅ Saved to {output_path}")
                return output_path
    
    return None


if __name__ == "__main__":
    # Basic usage - generates video with Veo's voice
    # For custom voice, see Step 3 (Speech-to-Speech swap)
    create_avatar_video(
        avatar_path="avatar.png",
        message="Hi! I'm your AI avatar. What would you like me to say?",
        output_path="my_avatar_video.mp4"
    )

Ideas for Your Avatar

Now that you have the pipeline, here's what you can build:

Daily briefings

Morning summary: weather, calendar, priorities
End-of-day recap

Notifications

"Your deployment succeeded!"
"Meeting in 15 minutes"
Custom alerts with personality

Content creation

YouTube intros/outros
Tutorial narration
Social media content

Personal touch

Birthday messages
Celebration videos
Check-in messages

If you haven't subscribed yet, click the subscribe button to stay up to date on cool upcoming tutorials on OpenClaw

Wrapping Up

You just built something most people don't know is possible:

✅ An AI-generated avatar with a consistent face
✅ Lip-synced video from a single image
✅ Optional custom voice that matches perfectly

The avatar in that video at the top? She didn't exist until I wrote this article. Now she can say anything I want.

What will your avatar say?

How I Created a Living Avatar for My OpenClaw Agent