TA的每日心情 | 开心 2020-4-8 10:45 |
|---|
签到天数: 227 天 [LV.7]分神
|
) h0 {' }- m* P& B M" j7 i) [
在论文里,这是第3.2.2节的内容" {& _1 f0 D/ h1 r5 s/ x6 P
& D6 \) m" ?" G$ h3.2.2. Efficient Implementation of Cross-Node All-to-All Communication* H3 w. @2 n d) ~. U) @4 u
In order to ensure sufficient computational performance for DualPipe, we customize efficient- N! k$ h9 W h2 o4 r. y
cross-node all-to-all communication kernels (including dispatching and combining) to conserve
0 M. T9 a0 a8 Q: ?the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific,- R5 x5 q; q4 S: i
in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications
. K4 w& b! b$ Y9 q5 [5 oare handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB0 ]& _7 q; \6 P
(50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each
) b' K3 F9 K* `. {% mtoken to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its- f% e3 ?9 [* X( q$ D8 o
routing decision is made, it will first be transmitted via IB to the GPUs with the same in-node8 G* ?# i2 k3 {7 G }$ p( Q
index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is$ d% s! b4 Y) d) j. z8 w
instantaneously forwarded via NVLink to specific GPUs that host their target experts, without$ Q7 m( l. y$ ]) s) Y
being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink4 ]9 A6 ?6 y) t! ^
are fully overlapped, and each token can efficiently select an average of 3.2 experts per node
. q4 N; j% f* A6 ^9 `" N5 Awithout incurring additional overhead from NVLink. This implies that, although DeepSeek-V3, b1 u$ D+ g7 w% ?8 N
13: d) e$ h' f2 n5 U2 P$ T1 D
selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts
3 q9 p& i! D. } }(4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under
3 \( L3 m) C8 {- isuch a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB) \5 q/ T+ ~. n- k( X5 S7 L8 _$ h s
and NVLink.9 N7 l" E9 @5 N0 b E8 s
In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition4 s' X! x9 X% k/ D7 F$ H
20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2): l1 g7 r1 ~; Q2 S8 x
IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The
9 N% f( i' M7 y1 b: h+ c' _7 H5 hnumber of warps allocated to each communication task is dynamically adjusted according to the+ c: o2 x1 J4 }# w: \9 K0 [. e
actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending,
8 m3 W4 ~. l+ M8 M- w0 s) _, H( @(2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also
- t+ A" v0 R7 y( ohandled by dynamically adjusted warps. In addition, both dispatching and combining kernels# V5 x2 Z$ l5 l$ q1 r
overlap with the computation stream, so we also consider their impact on other SM computation
9 K0 N7 n5 Y! I! L- {' hkernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and
! g3 H3 f: C: P5 Jauto-tune the communication chunk size, which significantly reduces the use of the L2 cache
1 i7 |* A- X: N6 p1 R! [# Wand the interference to other SMs.. K. ]( O. x) k i0 |& N9 R% S
! W3 J) f- c. ~" K8 }) ?
通俗一点说,就是为了实现高效的跨节点全面通信。解决的问题本质上和唐家山老师日志里说的双机对拷的场景差不多。一般来说单机多卡之间用nvlink,多机多卡之间依赖IB网络,但nvlink的速率是IB网络的速率的3.2倍,需要通过一些优化来实现更好的传输策略。这是一整套方案。
4 D! w! g( R4 X5 T
! Z5 \( h' e, m# r我的理解,使用PTX在其中,是为了更精准的定制线程执行减少通信块分配传输之间的串扰。
0 k% Z' R- c- o+ j4 V
8 T) K2 N! c# f- P/ U O目的不是为了绕cuda,反而是为了让cuda的效率更高。- B O& p4 O3 B
7 |- i5 q9 [2 g. p1 d4 n+ j! U: a6 v类比一下,就好比发现网卡驱动在对拷特定内存块的时候会和应用的线程执行出现串行导致效率降低,而绕开操作系统定义的与网卡驱动的接口,直接使用网卡支持的指令集进行了优化。 |
评分
-
查看全部评分
|