Ansible vmware ilo idrac integration

santoshkkhade · ‎01-13-2021

Hello,
I need to perform automated checks after unexpected Linux server reboots either vm or bare metal server

Requirement to automate in a way that

1. I should be able to fetch drs or data store or any other issues related to the vm at vmware level.

2. For physical servers I need to check any hardware failures on ILO or iDRAC

Any suggestions with ansible to automate this tasks would be great help

I have already created a shell script that validates the critical services on server but stuck checking hardware level errors on server after unexpected reboots ....

Looking at os level logs about hardware issues is not encouraging me to use them in automation as they are very random, I checked few links like this https://access.redhat.com/articles/206873 on redhat site but the error patterns mentioned in link are not matching always.

Fran_Garcia · ‎08-09-2021

I'd argue Ansible is the wrong tool for Root Cause Analysis of server crashes or reboots. Ansible is a fantastic tool for server configuration automation, for server deployment and for many other usecases, but it won't help you correlate events or analyze crashes in the way you need.

First of all, I'd start configuring all those messages in a log aggregation service (merely a central rsyslog, or something more elaborate), so alarms can be triggered in advance.

Then I'd ensure I have a working kdump configuration: https://access.redhat.com/solutions/6038 . A working kdump configuration is critical to analyze kernel dumps, and to leverage Red Hat Support. Also ABRT is very useful when analyzing application crashes: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/system_administ... .

For HP iLO / Dell iDrac, I'd ensure the output of " hpasmcli -s 'show iml' ; hpasmcli -s 'show iml'' is sent to a monitoring system. Or configured directly in the iLo/iDrac to be sent to the central log collection system and worked proactively if they are environmental issues (faulty ram / power supply / raid / temperature, etc).

Finally, for RHEL users it's also advisable to leverage Red Hat Insights, a free tool that can proactively detect configuration issues and propose fixes in advance: https://www.redhat.com/en/technologies/management/insights .

Hope this helps