Frank (@franklucky001) 在 MoE结构的LLM大模型中，激活参数是如何计算的？中发帖如题，Deepseek-V2-Lite，shared experts为2, total routed experts为64, active routed experts为6, 那么MLP参数为(2 + 6) / (2 + 64) ≈ 1/8, 而embedding和attention layer远远大于MLP experts层，为什么模型规格是16B/A2.4B, 激活参数2.4B是如何计算得来？

Frank (@franklucky001) 在 MoE结构的LLM大模型中，激活参数是如何计算的？中发帖

如题，Deepseek-V2-Lite，shared experts为2, total routed experts为64, active routed experts为6, 那么MLP参数为(2 + 6) / (2 + 64) ≈ 1/8, 而embedding和attention layer远远大于MLP experts层，为什么模型规格是16B/A2.4B, 激活参数2.4B是如何计算得来？