cancel
Showing results for 
Search instead for 
Did you mean: 
Chetan_Tiwary_
Community Manager
Community Manager
  • 526 Views

Red Hat AI Inference Server: Your LLM Your Cloud

Red Hat’s AI Inference Server  helps you run LLMs better no matter where you work or what hardware you use.

You can use it with:

  • Red Hat OpenShift AI
  • RHEL AI
  • Standalone container

The goal? Speed, flexibility, and easy deployment across all types of cloud and hardware setups.

What Makes It Special?


1. It’s Built on vLLM
At the heart of it is vLLM, a fast LLM server built for GPUs. It uses smart methods like:

  • PagedAttention: Helps manage memory.
  • Continuous Batching: Keeps GPU usage high.
  • OpenAI-style API: Makes it simple to connect to apps.

2. Smarter Use of GPUs

  • Tensor Parallelism: Splits models across multiple GPUs.
  • Pipeline Parallelism: Breaks models into parts that run on different GPUs or nodes.
  • Expert Parallelism: Efficient support for “mixture of experts” models.
  • Data Parallelism: Avoids GPU memory waste in some models (like DeepSeek V3).

 

3. It Compresses Models
Red Hat added a tool called LLM Compressor, powered by Neural Magic. It makes models smaller by:

  • Quantizing Weights and Activations
  • Removing Redundant Data

That means models use less memory and still give accurate answers.

4. Speculative Decoding
This lets a smaller model guess what the big model will say next. If the guess is right, you're done. If not, the big model fixes it. The result? Less waiting.

Easy to Deploy Anywhere

AI Inference Server runs as a container. That means you can use the same setup on the following :

  • OpenShift
  • RHEL
$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
--ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.redhat.io/rhaiis/vllm-cuda-rhel9 \
--tensor-parallel-size 8 \
--max-model-len 32768  \
--enforce-eager --model RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
  • Kubernetes clusters
  • Linux

It gives you consistent results across platforms.

It works with All Major GPUs :

  • NVIDIA
  • AMD
  • Google TPUs

This helps you to cut costs, use new hardware,avoid being stuck with one vendor etc.

It Comes with Optimized Models
You get easy access to top LLMs (like Llama and Mistral) on Red Hat’s Hugging Face page. They’re already compressed and tuned for vLLM. Just pick, pull, and deploy.                                                       

What’s Under the Hood?

  • EngineCore: Handles inference fast by grouping and scheduling work.
  • EngineCoreClient: Sends requests to the engine and handles APIs.
  • Queue system: Balances short and long queries smartly.

This setup helps you serve many users at once with little delay.

 

Chetan_Tiwary__0-1753213740429.png

Proceed for more details : 

https://learn.redhat.com/t5/Red-Hat-Training-Videos/Red-Hat-AI-Inference-Server-Technical-Overview/b... 

https://www.redhat.com/en/products/ai/inference-server 

https://interact.redhat.com/share/pQrkrC0aCXDmJM2GeF3V 

Labels (4)
0 Replies
Join the discussion
You must log in to join this conversation.