Aliyun Wan AI Video Generation Suite

Aliyun Wan is Alibaba Cloud's flagship visual generation model family from Tongyi Wanxiang Lab. Now integrated into Elser AI, Wan lets creators generate cinematic videos, animate still images, create talking digital humans, and produce synchronized audio-visual content — all without expensive GPUs or complex setups.

Explore the Aliyun Wan Model Family on Elser AI

Why Create with Aliyun Wan on Elser AI

Native Audio-Video Joint Generation & Digital Human Lip Sync

Unlike traditional models that generate silent video first and then add audio, Aliyun Wan 2.5+ outputs synchronized video with dialogue, sound effects, ambient sounds, and background music in a single forward pass. It supports phoneme-level synchronization for more than 8 languages, including English, Chinese, Japanese, and Spanish.

Try Aliyun Wan Now

Native Multimodal Diffusion Transformer Architecture (MD-DiT)

Aliyun Wan 2.5 and higher adopt a native multimodal diffusion transformer architecture, enabling parallel execution of visual, audio, and text generation within the same inference process. It is the industry's first model to achieve native audio-and-video synchronous generation.

Try Aliyun Wan Now

Director-Level Camera Control & Multi-Shot Narrative

Alibaba Cloud Wan easily handles complex camera operations that other video models struggle with — push-pull shots, focus switching, tracking shots, perspective switching, and crane shots — all working smoothly and seamlessly together. Wan 2.7 supports multi-shot compositing, ensuring consistency in character appearance through scene transitions.

Try Aliyun Wan Now

How to Use Aliyun Wan on Elser AI

Step 1: Sign Up & Choose Your Model

Create a free Elser AI account. In the video model selector, choose your Wan model — Wan 2.7, Wan 2.6, or Wan 2.6 Flash. Describe your video idea in natural language; Wan understands professional filmmaking terminology and complex motion descriptions.

Step 2: Enter Your Prompt & Upload References

Write a descriptive prompt — include camera movement, lighting, action, and mood. Upload a still image for image-to-video, or reference images and videos for reference-to-video to lock character appearance and voice across multiple shots.

Step 3: Customize & Generate

Adjust video duration (up to 15 seconds, depending on the model), resolution (720p or 1080p), and aspect ratio (16:9, 9:16, 1:1, 4:3, or 3:4). Generate your video and export as MP4 with a synchronized audio track — ready for social media, ads, or storyboards.

What Can You Do with Aliyun Wan?

Create Cinematic AI Videos from Text or Images

Generate cinematic multi-shot videos from text prompts, images, or multimedia references. Describe a scene, upload character references, or provide action examples. Wan delivers dynamic visuals with smooth camera movement, accurate lip sync, and immersive native audio.

Perfect for:

  • Short films & narrative shorts
  • Brand storytelling & ads
  • Social media clips & B-roll

Generate Consistent Characters Across Scenes (Reference-to-Video)

Wan Reference-to-Video maintains character identity, clothing, and facial features across multiple shots — eliminating the face-drift problem that plagues older video models. It also supports multi-character interaction videos using people or objects as protagonists.

You can:

  • Tell multi-scene stories with the same protagonist
  • Keep brand mascots & character designs on-model
  • Produce series-ready short dramas & episodic content

Create Talking Digital Humans

Animate a single portrait image with any audio clip to produce a talking digital human with natural lip sync and expressions. Drive presenters, avatars, and spokespeople directly from voice — no actor, studio, or motion capture required.

Great for:

  • Spokesperson, explainer & training videos
  • Turn a portrait into a talking avatar
  • Multilingual lip-synced dialogue

You Might Also Be Interested In

People Are Talking About Aliyun Wan

The native audio sync on Wan saved me hours of post-production. No more manually syncing voiceovers to video.

— Sarah C., video editor

Finally, a model that understands complex camera movements like dolly zoom and rack focus.

— David L., AI researcher

I generated a 15-second product video with voiceover and background music in under two minutes. Wan is a game changer for e-commerce.

— Jessica W., digital marketing manager

The character consistency across multiple shots is unreal. No more face drift — I can actually tell a short story with the same protagonist.

— Michael T., indie animator

We used Wan's digital human for a pitch video. The client thought it was a real actor. Native lip sync made all the difference.

— Derek P., agency producer

As a YouTuber, I now create cinematic B-roll inserts just from text prompts. It saves me days of shooting and stock footage hunting.

— Linda Z., content creator

FAQs

Aliyun Wan is Alibaba Cloud's next-generation AI visual generation model family, developed by the Tongyi Wanxiang Lab — the same team behind China's leading open-source video generation models. Wan creates high-quality, realistic videos from text, images, and audio.

Wan uses a native multimodal diffusion transformer architecture that combines the cognitive capabilities of large language models with high-fidelity pixel synthesis. It analyzes multimodal inputs (text, image, audio, video) and generates synchronized video and audio outputs in a unified framework.

Yes, Elser AI offers a free tier for Wan with limited monthly credits (up to 10 video generations). Paid plans unlock higher resolutions, longer durations, priority rendering, and access to the latest Wan 2.7 features. Wan's open-source models are also available for self-hosting at no cost.

Aliyun Wan offers several unique advantages: (1) Native audio-video joint generation — synchronized speech, SFX, and BGM in a single pass. (2) Digital human audio-driven animation — animate a single portrait image with any audio clip. (3) Open-source MoE architecture — roughly 50% computational savings with cinematic-grade output. (4) Multimodal input support — text, image, audio, and video can all be used as inputs.

Wan 2.7 supports clips from 2 to 15 seconds, while Wan 2.6 and Wan 2.6 Flash support 5, 10, or 15 seconds. For longer narratives, use the video continuation feature in Wan 2.7 to extend existing clips while maintaining visual coherence.

Wan generates at 720p or 1080p, 24 fps. Aspect ratios include 16:9, 9:16, 1:1, 4:3, and 3:4 — covering YouTube widescreen, TikTok/Reels vertical, Instagram square, and traditional broadcast formats.

Wan supports phoneme-level lip sync for 8+ languages including English, Chinese (Mandarin), Japanese, Spanish, French, German, Korean, and Russian. More languages are coming in future updates.

Wan 2.7 is the latest suite with multimodal input (text, image, audio, video), a Thinking Mode that interprets intent before rendering, first-and-last-frame generation, video continuation, and up to 5-subject reference tracking. Wan 2.6 focuses on reference-to-video role-playing, intelligent multi-shot storytelling, and up to 15-second 1080p output. Wan 2.6 Flash is the speed-optimized variant for rapid iteration.

None. You only need a device with internet access — all processing happens on Elser AI's cloud servers, with no GPU, no high RAM, and no software installation required. For self-hosting Wan's open-source models, a single 24GB GPU is sufficient for inference.

Read More about Aliyun Wan

Bring Your Stories to Life with Aliyun Wan

Sign up on Elser AI and unlock the power of Aliyun Wan — from text-to-video and image-to-video to talking digital humans and native audio sync. Generate professional cinematic videos instantly — no skills required, no GPU needed.

Try Aliyun Wan on Elser AI