Solved: GPU Quota Not Released After Pod Termination in Op...

aluciade

System Configuration:

Cluster: OpenShift cluster with a single node.
Hardware: Two NVIDIA A100 GPUs installed.
Software: NVIDIA GPU Operator successfully installed.

Issue Details:

I successfully deployed two pods, with each pod allocating one of the two available NVIDIA A100 GPUs.
Subsequently, I terminated one of these pods and confirmed its successful termination.
The Problem: Despite the pod being terminated, the cluster's GPU "used" quota was not decreased.
Impact: Due to the persistent "used" quota, no other pods requiring a GPU can be deployed, as the system still reports both GPUs as allocated.
I have already restarted the nvidia-device-plugin-daemonset, but this action did not resolve the issue.

It appears there is a leak in the cluster's GPU quota management, where GPU resources are not being properly released and accounted for after pod termination. Any advice on how to diagnose and resolve this GPU quota leak would be greatly appreciated. Thank you in advance!

aluciade

Hi, @Chetan_Tiwary_ .

Thank you very much for your response. I just discovered that I was tracking the wrong problem and jumped to the wrong conclusion.

What actually happened is that the new deployment I'm trying to create is crashing (CrashLoopBackoff). However, even though it is crashing, one GPU is still being allocated to it.

When I was trying to debug the crash, I received an error message stating that there were no GPUs available. This led me to the incorrect conclusion that the pod was crashing due to a lack of GPUs.

After I stopped the other deployment that was allocating a GPU, one GPU became available again. When I tried to debug the crashed pod again, I noticed that OpenShift actually creates a second temporary pod during debugging, which attempts to allocate an additional GPU. So, the reason for the error message was that there was no GPU available for this temporary pod, which is totally acceptable.

My mistake was due to the fact that I didn't know that debugging creates a new pod.

Anyway, sorry for the confusion. I jumped to the wrong conclusion. And thank you once again for the attention.

View solution in original post

Chetan_Tiwary_

@aluciade yes I agree, it appears as a leak. What is the grace period for pod termination set as ? If you reboot the GPU node - then what does it show - back to the normal?

What about compatibility ?

https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/troubleshooting-gpu-ocp.html

aluciade