Solved: VMs not migrated after node failure until node rec...

zafarali · ‎08-21-2025

While testing I observed unexpected behavior with VM eviction/migration during a node failure scenario.

A VM was running on worker-3.
We powered off worker-3 abruptly.
The VM’s VirtualMachineInstance remained in Running phase, even though the worker node was NotReady. The virt-launcher pod associated with this VM remained in Terminating state.
After 10 minutes, once worker-3 was powered back on, the VM was migrated to worker-2.

Expected Behavior (based on documentation):

Kubernetes should evict pods from an Unready node after 5 minutes (default default-unreachable-toleration-seconds=300).
With runStrategy: Always and evictionStrategy: LiveMigrate, we expected the VM to restart or migrate to another healthy node after 5 minutes of node downtime, without waiting for the original node to come back.

Observed Behavior:

VM migration/rescheduling only occurred after the failed node was powered back on (10 minutes later).
The VM did not restart/migrate while the node was powered off.

Zafar Ali
OpenShift Engineer

Chetan_Tiwary_ · ‎08-21-2025

@zafarali yes to all but it depends upon the underlying architecture that you are using and what is best for your use case scnario. Actually for the DR setup you need to have proper documentation and failover testing done in QA stage before implementing it into production.

some other thing you can think of is SNR ( Self Node Remediation ) and NHC ( Node Health Checks ).

Once the node is fenced, one of two things will happen. Either the node will quickly come back online, allowing the old pod to finish its termination process. Or, if the node stays down, it will be replaced by the Machine Health Check (MHC). In either case, the old virt-launcher pod is fully removed and hence, the VM is re-created on a different node because its runStrategy: Always policy is still in effect.

Refer here : https://docs.redhat.com/en/documentation/workload_availability_for_red_hat_openshift/23.2/html/remed...

https://docs.okd.io/4.11/nodes/nodes/eco-node-health-check-operator.html

https://www.redhat.com/en/blog/keeping-virtual-machines-available-by-allowing-nodes-to-self-repair

Als apart from affinity / anti affinity rules , you can think about Topology Spread constraints for the DR : https://stackoverflow.com/questions/73157345/kubernetes-spread-pods-across-nodes-using-podantiaffini...t

View solution in original post

Chetan_Tiwary_ · ‎08-21-2025

@zafarali I think when a virt-launcher pod is stuck in Terminating state, it actually prevents any new pod from being scheduled until Kubernetes confirms its deletion. The KubeVirt controllers will deliberately wait for that old pod to fully vanish before spinning up a replacement and this is why your VM only re-appeared after the node came back.

KubeVirt avoids initiating an automatic migration during a node outage because it can't guarantee whether the VM on the failed node is still alive or writing to shared storage. Starting another instance elsewhere could risk data corruption or "split-brain" scenarios.

Once worker-3 came back online, the kubelet and virt-handler restored their heartbeat with the control plane. That allowed the stuck virt-launcher pod to finally terminate. With the field clear, the controllers saw that the VM was no longer running but its runStrategy: Always still required it to be alive , so they scheduled a new virt-launcher pod, often landing on another healthy node like worker-2.

That’s why your VM popped back up in the Running phase but this time, comfortably on a different host.

zafarali · ‎08-21-2025

Thank you for the detailed explanation. I understand that KubeVirt deliberately avoids automatic migration during a node outage to prevent data corruption or split-brain scenarios, and that the virt-launcher pod must fully terminate before a replacement can be scheduled.

However, in our environment, we have a Disaster Recovery (DR) requirement: whenever a node goes down, the VM should come up on another healthy node automatically without waiting for the failed node to recover. This is critical to meet our uptime and availability SLAs.

Could you advise on the recommended approach or configuration in KubeVirt/OpenShift Virtualization to achieve automatic failover of VMs across nodes during node failures while still ensuring data integrity? For example, should we consider Live Migration with OnNodeFailure runStrategy, anti-affinity rules, or any high-availability setup within KubeVirt?

We want to implement this in a way that ensures the VM is immediately available on a healthy node while avoiding split-brain scenarios.

Zafar Ali
OpenShift Engineer

Chetan_Tiwary_ · ‎08-21-2025

@zafarali yes to all but it depends upon the underlying architecture that you are using and what is best for your use case scnario. Actually for the DR setup you need to have proper documentation and failover testing done in QA stage before implementing it into production.

some other thing you can think of is SNR ( Self Node Remediation ) and NHC ( Node Health Checks ).

Once the node is fenced, one of two things will happen. Either the node will quickly come back online, allowing the old pod to finish its termination process. Or, if the node stays down, it will be replaced by the Machine Health Check (MHC). In either case, the old virt-launcher pod is fully removed and hence, the VM is re-created on a different node because its runStrategy: Always policy is still in effect.

Refer here : https://docs.redhat.com/en/documentation/workload_availability_for_red_hat_openshift/23.2/html/remed...

https://docs.okd.io/4.11/nodes/nodes/eco-node-health-check-operator.html

https://www.redhat.com/en/blog/keeping-virtual-machines-available-by-allowing-nodes-to-self-repair

Als apart from affinity / anti affinity rules , you can think about Topology Spread constraints for the DR : https://stackoverflow.com/questions/73157345/kubernetes-spread-pods-across-nodes-using-podantiaffini...t

zafarali · ‎08-21-2025

Thank you for the detailed guidance and references. I understand that the behavior of virt-launcher pods during node failures is influenced by the underlying architecture, runStrategy, and node remediation mechanisms like SNR and MHC.

For our Disaster Recovery (DR) scenario, the requirement is that whenever a node goes down, the VM should come up on another healthy node without waiting for the failed node to recover, ensuring zero or minimal downtime.

We will of course follow proper documentation and perform failover testing in QA before production implementation. However, I would appreciate guidance on the recommended approach to meet our DR requirement, including:

Use of Machine Health Checks (MHC) and Self Node Remediation (SNR) in conjunction with KubeVirt.
Configurations involving Topology Spread constraints, affinity/anti-affinity rules, or runStrategy adjustments that ensure immediate VM availability on a healthy node.
Best practices to avoid split-brain scenarios while enabling automatic failover during node outages.

The references you shared are helpful, and any additional guidance specific to OpenShift Virtualization DR setup would be highly appreciated.

Zafar Ali
OpenShift Engineer

garvchaudhary · ‎08-22-2025

Hi @zafarali I have noticed if node is added in the cluster even if it is in poweroff state or failure state still VM and VMI is bind to the node. I believe we need to remove that node from the Cluster if node crashed or down then only VMI will be placed and runStrategy will work .

VMs not migrated after node failure until node recovers

OCP 4

OpenShift Virtualization High Availability