Hello everyone,
This is going to be a more generic question, but i would love to know what tools you use today for monitoring your production environment and setup alerting.
It would be great to learn more about your experiences as well.
My team is currently using a combination of Zabbix to collect data and alert and in addition Prometheus data from the OpenShift clusters. The environment in question produces over 500k items and ~250k triggers for alerting. That includes multiple OpenShift clusters and around ~500 hosts.
We use automation (Ansible/Python/RH Satellite) to set it all up for each host, and on the host use zabbix item autodiscovery extensively. For example we let it autodetect all systemd services on the host and alert on all that are enabled but not running.
How are you dealing with monitoring and alerting?
- Rudolf Kastl
Hello Rudolf,
When you say monitoring your environment, are you speaking of monitoring just Linux systems, are the network infrastructure as a whole?
Hello Trevor,
We are monitoring our whole Linux infrastructure (we exclusively run Linux on servers) and hardware devices like switches and firewalls.
I am interested into both scenarios, monitoring the linux infrastructure and monitoring the network infrastructure as a whole. Thank you for the question!
- Rudi
Red Hat
Learning Community
A collaborative learning environment, enabling open source skill development.