VM migration during node failures

Chetan_Tiwary_ · ‎11-15-2023

OpenShift Virtualization employs a comprehensive VM scheduling strategy that encompasses node selection, affinity rules, tolerations, taints, scheduler profiles, and node failure handling mechanisms. This multi-layered approach ensures efficient VM placement, tolerance to node imperfections, and resilience to node failures.

The virt-handler daemonset plays a crucial role in VM scheduling by maintaining communication with libvirtd instances to manage VM lifecycle. If the virt-handler daemonset loses the connection to the cluster's API server, the node cannot communicate its status. The node enters a failed state, and the remaining VMs cannot migrate to the healthy nodes.

In the event of a node failure, the virt-handler's absence triggers a sequence of actions:

Node Detection: The virt-handler's absence is detected within minutes by virt-handler and Kubernetes.

Node Marking: Control plane nodes (master nodes) mark the failed node as unschedulable.

Workload Migration: The failed node's workloads, including VMs, are migrated to healthy nodes according to resource placement and scheduling rules.

VM Placement Strategy : OpenShift Virtualization employs a two-pronged approach to VM placement: eviction strategy and node placement rules. The eviction strategy dictates how VMs are redistributed when nodes become unavailable, while node placement rules govern the initial allocation of VMs to nodes. Options are :

Node Selector

The node selector ensures that VMs are scheduled on nodes that match specific label criteria. This mechanism enables granular control over VM placement, allowing users to align VMs with specific hardware configurations or resource availability.

Affinity and Anti-affinity Rules

Affinity and anti-affinity rules provide more nuanced control over VM placement. Affinity rules specify preferences for co-locating VMs with certain characteristics, while anti-affinity rules prevent VMs with specific labels from residing on the same node. These rules can be used to optimize resource utilization, enforce isolation requirements, or improve application performance.

Tolerations and Taints

Tolerations and taints serve as a safety net for VM scheduling. Tolerations enable VMs to withstand certain node taints, ensuring their smooth operation even on nodes with minor imperfections. This flexibility enhances the tolerance of VMs to diverse node environments.

Scheduler Profiles

Scheduler profiles offer a broader perspective on VM placement, influencing the overall distribution of VMs across the cluster. The three available profiles – LowNodeUtilization, HighNodeUtilization, and NoScoring – cater to different resource utilization strategies and scheduling priorities.

Note :

1. Eviction strategies determine if VMs on the failed node are moved to another node or terminated:

Live migration: Perform a live migration to ensure that the VM is not interrupted if the node is placed into maintenance or drained.

Not defined: VMs are terminated if the node is placed into maintenance or drained.

2. The .spec.runStrategy object in a VirtualMachine manifest in OpenShift Virtualization is used to define the restart policy for the virtual machine. It specifies how the virtual machine should be handled in case it enters a non-running state. Options are : Always, RerunOnFailure, Manual etc.

Wasim_Raja · ‎11-22-2023

@Chetan_Tiwary_ Very good read, thank you

Chetan_Tiwary_ · ‎11-22-2023

@Wasim_Raja Thanks !

VM migration during node failures

affinity

DO316

eviction

EX316

live migrate

LiveMigrate

node failure

node selector

OpenShift

scheduler

strategy

taints

tolerations

Virtualization