System Configuration:
Issue Details:
It appears there is a leak in the cluster's GPU quota management, where GPU resources are not being properly released and accounted for after pod termination. Any advice on how to diagnose and resolve this GPU quota leak would be greatly appreciated. Thank you in advance!
Hi, @Chetan_Tiwary_ .
Thank you very much for your response. I just discovered that I was tracking the wrong problem and jumped to the wrong conclusion.
What actually happened is that the new deployment I'm trying to create is crashing (CrashLoopBackoff). However, even though it is crashing, one GPU is still being allocated to it.
When I was trying to debug the crash, I received an error message stating that there were no GPUs available. This led me to the incorrect conclusion that the pod was crashing due to a lack of GPUs.
After I stopped the other deployment that was allocating a GPU, one GPU became available again. When I tried to debug the crashed pod again, I noticed that OpenShift actually creates a second temporary pod during debugging, which attempts to allocate an additional GPU. So, the reason for the error message was that there was no GPU available for this temporary pod, which is totally acceptable.
My mistake was due to the fact that I didn't know that debugging creates a new pod.
Anyway, sorry for the confusion. I jumped to the wrong conclusion. And thank you once again for the attention.
@aluciade yes I agree, it appears as a leak. What is the grace period for pod termination set as ? If you reboot the GPU node - then what does it show - back to the normal?
What about compatibility ?
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/troubleshooting-gpu-ocp.html
Hi, @Chetan_Tiwary_ .
Thank you very much for your response. I just discovered that I was tracking the wrong problem and jumped to the wrong conclusion.
What actually happened is that the new deployment I'm trying to create is crashing (CrashLoopBackoff). However, even though it is crashing, one GPU is still being allocated to it.
When I was trying to debug the crash, I received an error message stating that there were no GPUs available. This led me to the incorrect conclusion that the pod was crashing due to a lack of GPUs.
After I stopped the other deployment that was allocating a GPU, one GPU became available again. When I tried to debug the crashed pod again, I noticed that OpenShift actually creates a second temporary pod during debugging, which attempts to allocate an additional GPU. So, the reason for the error message was that there was no GPU available for this temporary pod, which is totally acceptable.
My mistake was due to the fact that I didn't know that debugging creates a new pod.
Anyway, sorry for the confusion. I jumped to the wrong conclusion. And thank you once again for the attention.
ok @aluciade Glad that it is resolved and clear for you!
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.