Open source LongCat-Flash-Omni: 128K context + text/image/audio/video to speech output

I. Abstract

LongCat-Flash-Omni is an open-source, multimodal (Omni-modal) model from Meituan's LongCat team. It extends the ScMoE architecture of LongCat-Flash by providing unified modeling for text, images, audio, and video. It has approximately 560 bytes of parameters and 27 bytes of activations, primarily targeting millisecond-level end-to-end voice dialogue, 128K context, and real-time audio and video interaction scenarios exceeding 8 minutes. Its key features include early multimodal fusion training, decoupled modal parallel infrastructure, and the accompanying LongCat-Audio-Codec for high-quality speech output.

II. Core Features

Full-modal I/O: Input can be any combination of text, image, audio, or video, and output text or voice, adapting to real-time agents.
Low-latency speech: End-to-end speech understanding and synthesis latency is controlled at the millisecond level, which is suitable for "interrupted" dialogue.
Long context: Native 128K, which can support long meetings, multi-turn voice and long video understanding.
ScMoE architecture: 560B total parameters + 27B activations, with computational cost approaching the efficiency of pure text training.
Unified training paradigm: Integrate multimodal training in the early stages to avoid losing points in a single modality, and take into account listening, watching and speaking.

III. Installation

1. Clone the GitHub repository: git clone https://github.com/meituan-longcat/LongCat-Flash-Omni and enter the directory.

Install dependencies according to the environment instructions provided in the repository. You can choose between vLLM/SGLang/self-developed inference service. A GPU is required and it is recommended that the video memory be ≥40GB. Multiple GPUs can be used in parallel.

3. Pull the corresponding weights and examples from Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Flash-Omni; If voice output is required, install LongCat-Audio-Codec simultaneously.

After deployment, conduct text/voice tests via REST/WebSocket or the official LongCat.AI frontend.

IV. Typical Use Cases

Real-time voice assistant: outbound calls, customer service, and companionship interactions, requiring low latency and multi-turn memory.
AV Scene Understanding: Extract key points and answer questions from audio and video inputs for meetings/live broadcasts/courses.
Text and audio explanation: Input screenshots/photos/documents to generate audio explanations or multilingual summaries.
Agent Project Entry Point: Hands over the video/voice perception results to the toolchain or business process for further execution.

V. Ecology and Competitors

Ecosystem: Complementary to LongCat-Flash-Chat, LongCat-Flash-Thinking, and LongCat-Audio-Codec, enabling unified versions and training paradigms within the same organization.
Competitors: The capabilities of Qwen series Omni, InternLM/GLM speech multimodal versions, and MiniCPM-O/Omni-like models from various communities are comparable; LongCat's long context + millisecond-level speech is the differentiating factor.
Application side: The official website provides iOS/Android App and Web experience site to facilitate verification of voice link performance.

VI. Limitations and Precautions

True low latency relies on end-to-end voice links and high-bandwidth inference services, which cannot be fully reproduced on local or low-spec machines.
Video/long audio input will significantly increase video memory and computing power, so it is necessary to trim or segment according to the scenario.
While early multimodal fusion can improve consistency, it is sensitive to data format and annotation quality. Secondary training must strictly align with the official examples.
Open source repositories are updated frequently, and deployment scripts, quantization methods, and model sharding should be based on the latest versions.

VII. Project Address

https://github.com/meituan-longcat/LongCat-Flash-Omni

VIII. Frequently Asked Questions

Q: Does LongCat-Flash-Omni require an internet connection to perform inference?

A: The weights are open source and can be deployed locally or privately, but for speech synthesis and large-scale multimodal inference, it is recommended to use a GPU cluster to achieve the real-time performance shown in the official documentation.

Q: In what scenarios is the 128K context primarily used?

A: Suitable for long meetings, segmented understanding of long videos, and maintaining the state of multi-turn voice dialogues. It can also be used as a long document input window for multimodal RAGs.

Q: If only voice input and output are needed, is it necessary to load the full 560B?

A: The official architecture is ScMoE, with an actual activation of approximately 27 bytes. It can be combined with quantization/pruning and single-task fine-tuning to reduce resource consumption; please refer to the repository deployment instructions for details.

Related Articles

24-Hour AI News: Merger Rumors Align with Asia-Pacific Initiatives, Domestic Compliance and Application Advancement Go Hand in Hand

24-Hour AI News: New Asia-Pacific Governance Proposals, Industrial Implementation, and Security Compliance Advance Simultaneously

Is Mem0 worth integrating with an agent? Long-term memory is useful, but you need to manage boundaries

What kind of team is Haystack suitable for? It is more like a composable RAG engineering framework

Recommended Tools