DeepSeek's 38-page V4 technical report walks through their MoE routing, expert specialization patterns, and a compute breakdown that puts pretraining at 3.8M H800-equivalent hours.

Most interesting: the routing analysis suggests certain experts effectively specialize as language-pair translators, supporting the 'innate multilingual structure' hypothesis.

Weights are not released; only the paper.