Solved: rhvm takes a long time to come up after lab start

SubuRama · ‎02-07-2024

Everytime the lab environment is started after a previous stop, rhvm doesn't come up almost for five minutes. In one instance I had to stop and restart the whole env again. There are *no* steps to debug something like this in the course material.

Now even 10 minutes after the env started, rhvm is not up.

Looks like I have to delete the env and recreate it again.

Sigh.

Subu

Travis · ‎02-07-2024

@SubuRama -

The worst thing you can do is delete and restart the environment. RHVM does take a long time to come up and this is normal. The systems aren't really meant to go down powered off all the time. The reason things take so long is that the Utility server is providing the storage domain for RHVM.

The Utility server must be up for a while and the RHVM hosts must be up as well. The RHVM server is a VM and is self-hosted on ServerA. The RHV-H hosts must scan and verify the storage domains and then must initiate a startup of the RHVM server. It will then also scan the storage domains and status of the remaining RHV-H hosts showing the systems and marking them as online. All RHV-H hosts must be able to access the storage domain.

Travis Michette, RHCA XIII
https://rhtapps.redhat.com/verify?certId=111-134-086
SENIOR TECHNICAL INSTRUCTOR / CERTIFIED INSTRUCTOR AND EXAMINER
Red Hat Certification + Training

View solution in original post

Travis · ‎02-07-2024

@SubuRama -

The worst thing you can do is delete and restart the environment. RHVM does take a long time to come up and this is normal. The systems aren't really meant to go down powered off all the time. The reason things take so long is that the Utility server is providing the storage domain for RHVM.

The Utility server must be up for a while and the RHVM hosts must be up as well. The RHVM server is a VM and is self-hosted on ServerA. The RHV-H hosts must scan and verify the storage domains and then must initiate a startup of the RHVM server. It will then also scan the storage domains and status of the remaining RHV-H hosts showing the systems and marking them as online. All RHV-H hosts must be able to access the storage domain.

Travis Michette, RHCA XIII
https://rhtapps.redhat.com/verify?certId=111-134-086
SENIOR TECHNICAL INSTRUCTOR / CERTIFIED INSTRUCTOR AND EXAMINER
Red Hat Certification + Training

SubuRama · ‎02-08-2024

Understand. I might be mistaken but is High Availability for the RHVM in the picture here?Perhaps not, since the hostedengine VM is only on one host.

Since we don't start with an install of hostedengine, not sure how it's installed. Is it *guaranteed* to come up if the hosts are up? How long should we wait before deciding something is wrong?

It's just that 10 minutes seems like an eternity. And it's *not* consistent; sometimes it's just a couple of minutes.

Yes, I understand deleting and recreating the infra is the last solution but a lot of time RH support seems to offer this solution as the first one

Thank you for taking time to explain.

A question: How does RH436 (Hight Availability Clustering, Pacemaker etc.) compare with RHV?

Subu

Travis · ‎02-08-2024

@SubuRama -

So technically this environment was installed with the HostedEngine (HE) in a highly available setup. Initially, all hypervisor hosts are part of the Default DataCenter/Cluster so the Engine appliance could theoretically be on any system. However due to resource constraints, the HostA is the only system with enough memory to run the HostedEngine so it does availability checks when the storage domain comes up and the only thing it see that is available with resources it can start on is HostA.

One of the things you can do is to login to the HostA system and run the hosted-engine command. Specifically you can use hosted-engine --vm-status and it will give you the status. A lot of times, you can see that it is looking for and waiting on the storage domain to come up, be scanned and marked as available. Since the HE went down like a power outage, once the storage is up, connected, and guaranteed avaialble, the HE will try to startup by itself on one of the hosts in the cluster (again HostA is the only one) after checking all hosts for ability to start.

The High-Availability (HA) provided by RHV is nothing like RH436 clustering. In that instance you are using HA components installed on the server to have and assess quorum and to manage "mirrors" of an application. In the clustering course, you often have two systems both with the same configuration running the same application, etc. so if one goes down, there is no impact to service.

Within a VM environment (RHV or any other) the HA services provided there is if a VM or hypervisor goes down, if they are marked as HA, they get restarted on another environment. However, in this scenarion, even though the application and systems come back up, they can still be down for a small time window.

Travis Michette, RHCA XIII
https://rhtapps.redhat.com/verify?certId=111-134-086
SENIOR TECHNICAL INSTRUCTOR / CERTIFIED INSTRUCTOR AND EXAMINER
Red Hat Certification + Training

Emanuel_Haine · ‎02-07-2024

It happens the same with me and I just accepted that, taking into consideration that this is a virtual environment with limited resources.

Travis · ‎02-08-2024

@Emanuel_Haine -

Even with more resources and if it were a physical environment it would take time. Keep in mind, this isn't just regular VMs going down. It is the virtual infrastructure going down.

So if you brought this out to the physical world, you would be pulling the plug on the storage nodes, the hypervisors, and anything else connected. We are not shutting down the RHVM VM or going though any shutdown process. The VMs power off with a timer or you clicking the button. This is fine for normal VMs.

The RHV environment isn't normal VMs. You are shutting down the storage server, the hypvisors and because those get shutdown, the RHV-M appliance which is self-hosted is shutdown too (but not in maintenance mode or anything else). So when powering the systems back up, the storage domain needs to come online and hypervisors need to check and verify the storage domain and assign it active on HostA before the hosted engine can be restarted. Things are slightly delayed because of resources and everything being virtualized as we have nested virtualization happening here, but even in a physical environment all these checks would need to happen before the system can come back online.

Travis Michette, RHCA XIII
https://rhtapps.redhat.com/verify?certId=111-134-086
SENIOR TECHNICAL INSTRUCTOR / CERTIFIED INSTRUCTOR AND EXAMINER
Red Hat Certification + Training

Emanuel_Haine · ‎02-08-2024

@Travis

I understand your point and you are right. It is not fair to compare this course lab with another one, such as RHCSA. It is a different and more complex environment.