k8s容器放开锁内存限制问题
参考:https://access.redhat.com/solutions/1257953
问题
nccl-test容器docker.io/library/nccl-tests:24.12中跑mpirun,buff设置为NCCL_BUFFSIZE=503316480
提示out of memory:
pod-1:78:91 [0] include/alloc.h:114 NCCL WARN Cuda failure 'out of memory'pod-1:78:91 [0] include/alloc.h:119 NCCL WARN Failed to CUDA host alloc -268435456 bytes
pod-1:78:91 [0] NCCL INFO transport/net.cc:517 -> 1
pod-1:78:91 [0] NCCL INFO transport/net.cc:719 -> 1
pod-1:78:93 [0] NCCL INFO transport.cc:193 -> 1
pod-1:78:93 [0] NCCL INFO group.cc:133 -> 1
pod-1:78:93 [0] NCCL INFO group.cc:75 -> 1 [Async thread]pod-1:78:91 [0] proxy.cc:1620 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection
pod-1:78:78 [0] NCCL INFO group.cc:426 -> 1
pod-1:78:78 [0] NCCL INFO group.cc:566 -> 1
pod-1:78:78 [0] NCCL INFO group.cc:106 -> 1
pod-1: Test NCCL failure sendrecv.cu:57 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
.. pod-1 pid 78: Test failure common.cu:383
.. pod-1 pid 78: Test failure common.cu:592
.. pod-1 pid 78: Test failure sendrecv.cu:103
.. pod-1 pid 78: Test failure common.cu:625
.. pod-1 pid 78: Test failure common.cu:1123
.. pod-1 pid 78: Test failure common.cu:893
问题确认
容器内执行ulimit -a显示max locked memory只有64k

放开容器max locked memory限制
在 /etc/systemd/system/docker.service中增加LimitMEMLOCK=infinity

然后重启docker:
systemctl daemon-reload systemctl restart docker
总结
以上为个人经验,希望能给大家一个参考,也希望大家多多支持脚本之家。
相关文章
K8S下http请求在ingress和nginx间无限循环的问题及解决
文章描述了UAT环境中因Nginx与IngressController代理循环导致400错误的排查过程,发现proxy_set_header Host配置引发Host头携带Nginx域名,导致请求反复转发,最终X-Forwarded-For头溢出,解决方法是移除该配置2025-07-07
K8S中某个容器突然出现内存和CPU占用过高的问题及解决方案
当K8S容器出现资源过载时,可通过kubectl监控定位问题,调整资源限制,优化应用代码,拆分多应用容器,利用监控工具排查,实施水平扩展或迁移负载,确保集群稳定运行2025-07-07


最新评论