LLM training involves a great number of GPUs, how to integrate, manage and allocate the GPU resources on openshift? Any one has any experience, please share here to discuss......
Hello @Tengfei !
Could you please elaborate a little about this issue ? What is this LLM openshift training ?
It is AI large Language Model training ... which is the most popular research area on AI. It requires a great scale of GPU memory and GPU calculation resources. So how could we manage these resource to meet the LLM training calculation resource requirement and improve the GPU efficiency within and between AI calculation servers?
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.