NCCL

NCCL

NVIDIA Collective Communications Library https://developer.nvidia.com/nccl

介绍

NCCL 主要是做多 GPU 间的高效集合通信（collective communication），比如：ncclBroadcast、ncclReduce、ncclAllReduce、ncclReduceScatter、ncclAllGather
这些函数都是集合操作，而不是传统的 MPI（Message Passing Interface）那种点对点通信（如 MPI_Send / MPI_Recv）
NCCL 不支持真正意义上的 GPU Point-to-Point（Send/Recv）通信。 它只支持 collective 通信（广播、归约、AllGather 等），也就是说 NCCL 所有 GPU 必须共同参与同一个操作
点对点的数据传输可以使用CUDA Aware MPI
NCCL 既支持单节点多GPU，也支持多节点多GPU

安装配置

2025/07：测试CUDA Toolkit 12里面没有包含NCCL（的头文件，so文件没有检查）

自行编译：https://github.com/NVIDIA/nccl
clone好代码之后，执行：make -j src.build
需要的头文件（其实就一个nccl.h）和so文件就构建完成，可以拿去用了

[zh-ge@gekko build]$ pwd
/home/zh-ge/git/nccl/build
[zh-ge@gekko build]$ ls -lh
total 16K
drwxr-xr-x 2 zh-ge ppl 4.0K Jul 17 15:04 bin
drwxr-xr-x 2 zh-ge ppl 4.0K Jul 17 15:04 include
drwxr-xr-x 3 zh-ge ppl 4.0K Jul 17 15:08 lib
drwxr-xr-x 9 zh-ge ppl 4.0K Jul 17 15:04 obj

基础使用

// Generated by AI
 
#include <nccl.h>
 
ncclComm_t comm;
int nranks = 4;   // 4个GPU
int myrank = 0;   // 当前GPU编号
cudaSetDevice(myrank);
ncclUniqueId id;
ncclGetUniqueId(&id);
ncclCommInitRank(&comm, nranks, id, myrank);
 
float* sendbuff;
float* recvbuff;
size_t count = 1024;
cudaMalloc(&sendbuff, count * sizeof(float));
cudaMalloc(&recvbuff, count * sizeof(float));
 
ncclAllReduce((const void*)sendbuff, (void*)recvbuff,
              count, ncclFloat, ncclSum,
              comm, 0);  // 0 表示使用默认 CUDA Stream
 
ncclCommDestroy(comm);
cudaFree(sendbuff);
cudaFree(recvbuff);

启动程序

MPI的程序启动需要mpirun
NCCL好像不需要？（需要确认）

Table of Contents

NCCL

介绍

安装配置

基础使用

启动程序