Back to AI is open source
DeepSeek Engram Interpretation: Use O(1) to search for conditional memory to add a "new sparse axis" to large models

DeepSeek Engram Interpretation: Use O(1) to search for conditional memory to add a "new sparse axis" to large models

AI is open source Admin 86 views

1. Abstract

Engram is an open-source "Conditional Memory" module from DeepSeek, and the core idea is to add an extensible form-lookup memory primitive to the Transformer: a part of the more static pattern/knowledge is stored in the form of an N-gram memory table, retrieved in an approximate O(1) manner during inference, and fused with the current hidden state. The conclusion given by the official repository is that under the constraints of equal parameters and equal computing power, Engram-27B has stable returns compared to the MoE baseline in tasks such as knowledge, reasoning, code, and mathematics. And the mechanism analysis shows that it can reduce the burden of "reconstruction" of the static model in the early layer, so as to leave the effective depth for more complex inference calculations.

2. Core features

1. O(1) Form form condition memory

By deterministic addressing and retrieval of static N-gram memory, the "knowledge search" is partially separated from dense neural computing, reducing the occupation of the computational path.

2. "Sparse new axis" complementary to MoE

MoE expands capacity through conditional calculation, and Engram expands capacity through conditional memory: one is "calculated" and the other is "checked", which can be more effectively allocated model capabilities under the same FLOPs after combination.

3. The U-shaped scale law is used for capacity allocation

The official trade-off between "Computational Capacity (MoE) and Static Memory Capacity (Engram)" is given, and points out that there is a U-shaped scaling law that can guide engineering trade-offs.

4. The mechanism explanation is closer to the engineering intuition

The repository explicitly mentions that Engram may eliminate the need for early layers to repeatedly reconstruct static patterns, leaving the number of layers and representation capabilities to subsequent more critical inference processes, which can be understood as "more effective deepening for inference".

5. System efficiency and landability

Deterministic addressing is used to offload hyperscale embedded tables to host memory, and the increment of inference overhead is kept as controllable as possible.

3. Installation

1. Prepare the environment

Python 3.8+, an isolated environment (venv/conda) is recommended.

2. Installation dependencies

Quick Start by repository: Install dependencies such as torch, numpy, transformers, sympy, etc.

3. Run the demonstration

The repository provides engram_demo_v1.py for demonstrating Engram's core data flows; This version will mock some standard components (e.g. Attention/MoE, etc.) and highlight how Engram modules work.

4. Typical use cases

1. Knowledge-intensive Q&A and factual recall

When the task relies more on "stable knowledge/fixed expression mode", lookup memory can reduce the repetitive pattern reconstruction of the model in the first few layers.

2. Stable fragment reuse in long context

Static memory hits for recurring short fragments (fixed phrases, code templates, common formats) to reduce invalid calculations in long contexts.

3. Templated structure of code and mathematical scenarios

In tasks with more "common derivation routines/code skeletons", memory channels are used to undertake more static structures, and computational channels focus on combination and reasoning.

4. Cost-effective expansion combined with MoE

Under the premise that the total parameters and total FLOPs are limited, the "part of the capacity is put into the static memory table" in exchange for a higher effective capacity density.

5. Ecology and competing products

1. Ecological status

Currently, the official repository is mainly based on papers + structure diagrams + experimental diagrams + demos, which is suitable for quickly understanding the new component of "conditional memory" and evaluating the combination space with the existing MoE stack.

2. Competing products and adjacent directions

Neighboring ideas typically include: RAG (External Retrieval Enhancement), kNN-LM/Nearest Neighbor Retrieval, Traditional N-gram/Caching, and various sparse attention/sparse routing architectures. The difference of Engram is that it uses "trainable static memory table" as the internal primitive of the model, and emphasizes the division of labor and scaling with MoE. The actual effect still needs to be verified in combination with specific data distribution, training recipe, and deployment constraints.

6. Limitations and precautions

1. Details and reproduction caliber of the paper

The repository provides key conclusions and demos, but the details of large-scale training, addressing implementation, and complete ablation should still be based on the paper.

2. Memory and deployment trade-offs

Offloading huge memory tables to host memory reduces memory pressure, but introduces new constraints on bandwidth, latency, and engineering complexity.

3. Applicability depends on the form of the task

If the main bottleneck of the task is "dynamic reasoning/combinatorial generalization" rather than "static mode/knowledge reuse", the benefits may not be as obvious as knowledge-intensive tasks.

4. Integration cost with existing training system

To connect new modules to existing MoE/attention implementation and parallel strategies, you need to evaluate training stability, throughput, and monitoring metrics (such as hit rate, table capacity utilization, etc.).

7. Project address

https://github.com/deepseek-ai/Engram

8. Frequently asked questions

Q: What are the core keywords of Engram and what problems does it solve?

A: The keywords are Conditional Memory, Scalable Lookup, O(1) lookup memory, and N-gram memory. It tries to give the transformer the ability to "native knowledge lookup" to separate some of the static patterns/knowledge from intensive computation.

Q: What is the relationship between Engram and MoE?

A: MoE expands capacity through conditional calculation, and Engram expands capacity through conditional memory. The two can complement each other to form a division of labor of "calculation (calculation) + check (memory)".

Q: What does the official mechanistic analysis mean by "more effective and deeper"?

A: The repository view is that Engram reduces the burden of rebuilding static patterns at the early layers, making network depth more focused on subsequent complex inference, which is like "leaving depth for key parts".

Q: How can I quickly verify how Engram works?

A: To directly run the engram_demo_v1.py provided by the warehouse, first understand the data flow and fusion location. The demo will mock common components to highlight Engram.

Q: Is Engram suitable as an alternative to RAG?

A: It is more suitable as a supplementary direction: RAG is external document retrieval and update, and Engram is internal static memory primitive language and computing/memory division of labor. The substitution depends on whether the task requires external updatable knowledge and a controllable retrieval link.

DeepSeek open-source Engram conditional memory module revealed O(1) Why is it important to check the table? Engram-27B and the reason why the computing power exceeded the MoE baseline Engram uses N-gram static memory table expansion transformer to raise controversy Engram-27B is implemented in the mathematical stable gain of knowledge reasoning code DeepSeek Engram strips knowledge lookup from intensive computational stripping of FLOPs Engram and MoE complement each other's new axes exposed: one counts and one checks how to divide labor Engram proposed the U-shaped scale law How to choose between MoE computing capacity and static memory Engram mechanism explanation: Early layers no longer reconstruct static pattern reasoning deeper and more effective DeepSeek Engram supports offloading host memory for very large tables, but the latency cost is geometric Engram_demo_v1 online How to understand the conditional memory data flow and fusion location as quickly as possible Engram is suitable for knowledge-intensive Q&A Why static mode multiplexing is better than pure computation Engram reuses fixed fragments in long contexts Can you reduce invalid calculation paths? Engram enhances code and math template structures Calculate channel focus and combine reasoning is stronger Engram+MoE cost-effective expansion Why is the density higher under FLOPs? Engram vs RAG who is stronger whether intra-model static memory can replace external retrieval Differences between Engram and kNN-LM: Static memory primitives can be trained to attract attention DeepSeek Engram Ecological Status: What can the experimental diagram demo illustrate? Interpretation of Engram's core keywords Conditional Memory: What pain points does it solve? Engram implements O(1) retrieval with deterministic addressing Where are the advantages of project landing? Engram saves the front floor the burden of rebuilding Why can depth be left to complex reasoning? The larger the static memory capacity of Engram, the better The U-shaped law gives the answer Engram offloads host memory to save video memory Will bandwidth and throughput become new bottlenecks? Engram integration MoE with attention is costly How training stability is assessed What metrics does Engram need to monitor? Hit rate and table capacity utilization are key Engram reproduction caliber reminder: What is the difference between the warehouse conclusion and the details of the paper Where is the boundary of Engram? The benefits of dynamic reasoning tasks may not be obvious Engram has built "lookup" into the model Why is it more like a new primitive than caching N-grams Engram is more friendly to stable knowledge versus fixed expressions Why reduce double counting DeepSeek Engram-27B compared to MoE baseline Why is computing power still profitable? Engram's sparse new axis is different from sparse attention Is it more direct to check the table and expand the capacity? Engram Engineering Intuition: What it means to leave representation capabilities to the back Engram Installation Quick Start Points: How to configure torch transformers sympy dependencies Why is Engram demo mock? Attention/MoE What core paths are highlighted Engram is used for long-context stabilized fragment multiplexing How much computing power can be saved on the code template Engram hits static structures in mathematical derivation routines Why it's easier to improve Will the combination of Engram and MoE change the expansion route? Calculation + investigation synergy has become a trend Engram's Scalable Lookup Meaning Scalable lookup does not slow down reasoning Engram's deterministic addressing is advantageous for deployment But will flexibility be sacrificed? Engram large-scale training details are not fully disclosed What are the landing risks and pitfalls Engram's memory and deployment trade-offs: saving video memory and adding latency are not worth it Engram Neighboring Competitor Comparison: RAG kNN-LM Cache N-gram Which is more suitable The project address of DeepSeek Engram is public Will conditional memory become standard for Transformers? Engram Key Controversies: Whether it is cost-effective to convert memory into computing power in the long run Engram "nativeizes" knowledge search Why might the model capability allocation logic be changed? Engram's "more effective and deeper" statement whether the mechanism evidence is consistent with the engineering explanation How Engram is stronger under the same FLOPs The separation of static memory and computational paths is key Difficulties in integrating Engram with existing parallel strategies How distributed training and routing work together Engram is used for factual recall and trivia Why is it more stable than pure MoE? Can Engram make up for the shortcomings of the MoE? Conditional memory allows the model to calculate less and check more

Recommended Tools

More