Back to AI is open source
MiniMax open-source VTP: Scalable pre-training allows Visual Tokenizer to directly improve the quality of DiT generation

MiniMax open-source VTP: Scalable pre-training allows Visual Tokenizer to directly improve the quality of DiT generation

AI is open source Admin 82 views

1. Abstract

VTP (Visual Tokenizer Pre-training) is an open-source visual tokenizer pre-training framework developed by the MiniMax (Hailuo) team, which is aimed at next-generation generative models such as diffusion models and Diffusion Transformer (DiT). The project pointed out that the traditional "rebuild-only" tokenizer training will bias the latent space towards low-level pixel information, resulting in the problem of pre-training scaling that "reconstruction is more accurate but not necessarily better generation". VTP will jointly optimize representation learning and compression-reconstruction, so that the tokenizer can more stably translate into downstream generation quality improvement when the model scale, data, and computing power expand, and try not to change the standard DiT training specifications.

2. Core features

  1. Joint optimization of three types of objectives: graphic and text comparison learning, self-supervised learning and reconstruction goal joint training, taking into account semantic representation and decoding.
  2. Scalable tokenizer scaling: Emphasizing that investing computing power in tokenizer pre-training can bring downstream generation benefits, rather than just pursuing lower reconstruction errors.
  3. Generation-oriented evaluation link: Provide an integrated control of understanding (zero-shot/linear probing), reconstruction (rFID) and generation (FID based on LightningDiT).
  4. Open source weights and multi-size models: Hugging Face provides model specifications such as Small/Base/Large to facilitate trade-offs based on resources and effects.

3. Installation

  1. Create an environment: conda creates a Python 3.10 environment and activates it.
  2. Initialize submodules: The warehouse uses submodules to manage some dependent code, which needs to be pulled recursively.
  3. Install dependencies: Install Python dependencies according to requirements.
  4. Run the evaluation script: modify the path according to the script description and execute the zero-sample, linear detection, reconstruction and generation evaluation script; Generate an evaluation link and use LightningDiT-related scripts to complete feature extraction, training, and sampling.

4. Typical use cases

  1. Stage-1 tokenizer of DiT/diffusion model: Verify the influence of "stronger latent" on generation quality and convergence speed without changing the generator structure.
  2. Visual representation extraction: used for retrieval, classification, clustering or downstream light tasks (zero-shot and linear probing).
  3. Research reconstruction-semantic trade-off: Compared with the traditional VAE/VQ tokenizer, the semantics and generative learnability changes of the latent space after adding representation learning are analyzed.
  4. Reproduce the experimental curve: Based on the open source script, the scaling comparison of parameters/data/computing power dimensions is used to construct the correlation curve between tokenizer and generation performance.

5. Ecology and competing products

  1. Related ecology: The training and evaluation link involves comparative learning, self-supervised representation learning, and DiT generation evaluation processes, which is convenient for aligning with mainstream visual representation and diffusion generation systems.
  2. Direction of competing products: traditional LDM commonly used reconstructed VAE, VQ-VAE/VQGAN, etc. as tokenizers; There are also improved routes to enhance latent space through distillation or regularity. The difference of VTP is that it takes "understanding/characterization" as the key driver of generative scalability and verifies its gain to downstream generation with systematic evaluation.

6. Limitations and Precautions

  1. Resource threshold: Complete reproduction of large-scale tokenizer pre-training and generation evaluation requires strong computing power, data, and engineering pipelines.
  2. Engineering integration cost: Before replacing the existing tokenizer, it is necessary to evaluate the latent variable interface, compression ratio, decoding speed, and end-to-end stability.
  3. Results depend on training formula: different data distributions, sampling strategies and generator settings will affect the final indicators, and it is recommended to do strict budget comparison and visual inspection.
  4. The project is still evolving: Some models/scripts and instructions may be adjusted with version updates, and it is recommended to refer to the latest content of the repository and model pages.

7. Project address

https://github.com/MiniMax-AI/VTP

8. FAQs

Q: What is the core problem solved by VTP (Visual Tokenizer Pre-training)?

A: Solve the "visual tokenizer pre-training scale problem", that is, it is difficult for traditional tokenizers that only rebuild training to stably convert more computing power into downstream DiT/diffusion generation quality improvement.

Q: Why is VTP emphasizing representational learning more important for generation (Diffusion Transformer/DiT)?

A: The idea is to generate a learnable latent space that relies more on high-level semantics and structure; Only pursuing pixel-level reconstruction accuracy can easily make the latent space low-level information, resulting in stagnation of generation revenue.

Q: Can VTP improve the generation quality without increasing the generator's training power?

A: The goal is to put the main increment on the tokenizer pre-training side and try to keep the standard DiT training specifications comparable, so as to lead to better generation with better latency.

Q: How should I choose VTP-Small/Base/Large on Hugging Face?

A: Generally, larger tokenizers have stronger representation capabilities but higher resource requirements; You can use Small/Base to go through the evaluation link first, and then evaluate the benefits of Large under the same budget.

Q: What should I focus on when replacing the VAE/VQ tokenizer of an existing LDM?

A: Focus on latent variable shape and interface compatibility, compression rate and decoding speed, generation training stability, and FID/convergence speed and subjective quality comparison under the same training budget.

VTP Visual Tokenizer Pre-Training Framework VTP joint optimization improves generation quality VTP solves the problem of rebuilding the strong and the weak VTP is a tokenizer solution for DiT VTP comparative learning enhances subspatial semantics VTP self-supervised and scalable training VTP reconstruction and representation learning are trained together VTP allows tokenizer scaling Full analysis of VTP integrated evaluation link VTP assessed the quality of reconstruction with rFID VTP is based on LightningDiT to measure FID VTP open source weights from Small to Large VTP is compatible with the diffusion model stage1 module VTP does not change the DiT specification to improve generation VTP makes computing power investment more focused on tokenizers VTP improves the learning structure of the latent space VTP vs. traditional VAE reconstruction routes VTP compared with VQGAN latent space advantages VTP is used for zero-shot visual characterization extraction VTP is used for linear probing downstream tasks VTP supports retrieval, classification, and clustering applications VTP studies reconstruct semantic trade-off curves VTP reproduces the computing power of experimental parameter data VTP replaces LDM tokenizer notes VTP focuses on latent variable interface compatibility VTP focuses on compression and decoding speed VTP focuses on training stability and convergence speed VTP and budget to improve FID performance VTP reduces the risk of stagnation in generating returns VTP copes with large-scale pre-training bottlenecks VTP makes the latent space no longer biased towards low-level pixels VTP emphasizes semantic structure-driven generation VTP is suitable for enterprise-level generation of R&D pipelines VTP open source repository MiniMax-AI project VTP script overrides understanding reconstruction generation VTP generation evaluation includes feature extraction process VTP supports Small to run through the entire link first VTP then compares Base and Large returns VTP resource threshold and computing power demand prompt VTP Engineering Integration Cost Assessment Guide The VTP results are explained by the influence of data distribution VTP recommends strict budget visualization checks VTP project iterations need to pay attention to the latest instructions VTP was used for the DiT convergence velocity control experiment VTP is used as a subspatial regularization alternative Analysis of VTP and distillation enhanced routes VTP is geared towards the next generation of generative model bases VTP helps to steadily improve the quality of diffusion generation VTP makes tokenizer pre-training more controllable VTP from characterization to decoding joint optimization

Recommended Tools

More