admin管理员组

文章数量:1619183

错误信息可能是: unhandled cuda error, NCCL version 2.4.8

设置以下环境变量,查看nccl 错误日志:

export NCCL_SOCKET_IFNAME=enp6s0

export NCCL_IB_DISABLE=1

export NCLL_DEBUG=info

注意,以上export NCCL_SOCKET_IFNAME=enp6s0 中的enp6s0 为你本地的网卡名称,用ifconfig获取。

cuda版本不匹配 会有以下信息: 

znsoft-virtual-machine:102553:102553 [0] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
NCCL version 2.4.8+cuda10.2
znsoft-virtual-machine:102620:102620 [1] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Channel 00 :    0   1
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
znsoft-virtual-machine:102620:102695 [1] NCCL INFO comm 0x7f0438002580 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
znsoft-virtual-machine:102553:102694 [0] NCCL INFO comm 0x7fbb600025a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE

znsoft-virtual-machine:102620:102620 [1] enqueue:197 NCCL WARN Cuda failure 'invalid device function'
znsoft-virtual-machine:102620:102620 [1] NCCL INFO misc/group:148 -> 1
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Launch mode Parallel

znsoft-virtual-machine:102553:102553 [0] enqueue:197 NCCL WARN Cuda failure 'invalid device function'

注意最后一行: enqueue:197 NCCL WARN Cuda failure 'invalid device function'

这是pytorch编译时的cuda和本机安装的cuda不一致导致。

注意要安装nccl 包,我是用以下命令编译的:

git clone https://github/NVIDIA/nccl.git

cd nccl 

export NVCC_GENCODE=-gencode=arch=compute_80,code=compute_80

make CUDA_HOME=/usr/local/cuda  

make install

 解决办法:

安装pytorch时,用的cuda和本机安装的一致:

运行nvidia-smi 后得到的版本要和pytorch安装 时的版本一样,我的是: CUDA Version: 11.7    

安装pytorch要使用 cuda 11.6/7之类接近的版本:

本文标签: failureInvaliddeviceNCCLWARN