- Abstract
Z-Image is a family of 6B parameter image generation base models open source by Tongyi-MAI, using the Single-Stream Diffusion Transformer (S3-DiT) architecture. Unlike Z-Image-Turbo, which emphasizes speed, Z-Image is positioned as a "full-capacity, non-distilled" backbone model for creators, researchers, and developers who need greater control, richer style coverage, and higher generative diversity.
- Core features
- Non-distilled basic model: retains complete training signals and supports full CFG (Classifier-Free Guidance), which is more suitable for complex prompt engineering and professional workflows.
- Wide coverage of aesthetics and style: from realistic photography, film quality to illustration, animation and a variety of stylized expressions, suitable for multi-dimensional creative exploration.
- Stronger output diversity: The composition, character facial identity, and lighting changes are more significant under different random seeds, making it easier to "have their own people" in multiplayer scenes.
- Robust negative prompts: More stable responses to negative prompts, which can be used to suppress artifacts, control composition, and reduce unwanted elements.
- Oriented to secondary development: It is naturally suitable as a LoRA fine-tuning base, and can be extended to structural condition control (such as ControlNet) and semantic condition control.
- Installation
- Get the code: Clone the official GitHub repository, create a Python environment according to the repository instructions and install dependencies.
- Get the weight: Download the corresponding variant (Z-Image / Turbo / Omni-Base / Edit) in Hugging Face or ModelScope.
- Run inference: Refer to the Quick Start or sample script of the warehouse to select parameters such as steps, CFG, and resolution according to the memory and speed requirements.
- Typical use cases
- Style exploration and creative divergence: It is more advantageous when a large number of high-difference candidate images (different compositions/light and shadow/character images) are required.
- Professional prompt word project: Rely on CFG, negative prompt words and multiple rounds of iterations to pursue "more controllable" picture landing.
- Downstream fine-tuning: Z-Image/Omni-Base is used as the base for training style LoRA, character LoRA, and industry material LoRA.
- Image editing: Use Z-Image-Edit for natural language-driven local modifications, style transfers, and consistent editing.
- Development integration: embed generation capabilities into the workflow (poster draft, batch generation of materials, A/B visual solution comparison).
- Ecology and competing products
- Ecosystem: The code and weights are distributed on GitHub, Hugging Face, and ModelScope, and online demos/galleries are provided for experience.
- Competing product perspective: Compared with common distillation acceleration models, Z-Image emphasizes "basic capabilities, controllability and fine-tuning"; The advantage over closed-source commercial models is that they are open-source, transparent and customizable, but the final result still depends on the quality of your prompts, parameters, and downstream fine-tuning.
- Limitations and precautions
- When the basic model pursues degree of freedom, stable reproduction of the same picture requires stricter seed/parameter/version management.
- CFG, resolution, and number of steps will significantly affect the quality and speed, so it is recommended to establish team-level default configuration and regression use cases.
- Scenarios such as multi-person consistency and complex text typesetting are still recommended for manual sampling and later correction.
- Different variants are positioned differently: Turbo is suitable for high throughput and low latency; Z-Image is better for creation and fine-tuning; Edit for editing tasks; Omni-Base is more of a "universal base".
- Project address
https://github.com/Tongyi-MAI/Z-Image
- Frequently asked questions
Q: What is the core difference between Z-Image and Z-Image-Turbo?
A: Z-Image is biased towards "full-capacity non-distillation base + CFG controllability + fine-tuneable", and Turbo is biased towards "distillation acceleration + faster graphing with fewer steps".
Q: Why is Z-Image better suited as a LoRA/ControlNet base?
A: Non-distilled models usually retain more complete representation capabilities and training signals, which is more conducive to injecting new styles and conditional control downstream.
Q: How to use negative prompts to improve Z-Image image stability?
A: Common artifacts, deformities, duplicate limbs, low definition, wrong text, etc. are clearly written into negative prompts, and the parameters are adjusted with CFG and step count.
Q: What editing tasks is Z-Image-Edit suitable for?
A: It is more suitable for "directive editing", such as local replacement, style transfer, background adjustment, and repainting to maintain subject consistency.