It's rather difficult to use automation to deal with troubleshooting, the only way that is really use full is configuration management.
To check all the systems with tooling like ansible/puppet/chef/saltstack to enforce working configurations and fixes for already known bugs that have been fixed in the past. (And properly analysed with root cause analysis etc)
Besides that log analysing tools like elk could give you a hand to help you detect/pinpoint failures in complex automation chains.
Maybe someone did find/build a tool that tries to fix problems, but I expect it has a low success rate ( and I would definitely validate the solution first before implementing it )
Hi, Full configuration management is an option, but not troubleshooting. There are already tools for it:
- xsos (checks sosreport data or current system), can report information on some network buffers or memory load
- sarstats (can check sar historic data to automate the 'peak' detection on load, etc)
- lynis (can check on known security bugs or misconfigurations)
- last year we gave a talk at Devconf.cz (available on google) on "Detect pitfalls on OSP deployments" on things we worked at for it.
There are also other tools that allow to apply remediation
Usual company wikis or knowledge base store the information on the known issues, instead of checking for it next time an issue happens that's the kind of automated checks that can be performed.... and once the known has been ruled out, you can go for doing the traditional step-by-step approach (then, document it to feed it back into the detection process)
We use Instana to monitor and troubleshoot Java applications. We previously used AppDynamics, but it became too expensive.
You mentioned Lynis. We use Lynis for general security auditing and hardening (alongside OpenSCAP), however, I would not say that it's a troubleshooting tool.
You can write quick scripts to quickly collect information about the system but I think these are are hand-rolled by admins with an eye toward problems commonly encountered in their specific environments. They do, however, make data collection fast an consistent. You might also pipe the collected output through an awk script that highlights anomalys.
One thing you might look at is Red Hat Insights.
Insights is a SaaS-based tool that's hosted on Red Hat Customer Portal. You send a small amount of system metadata to Insights (less than an sosreport), and it does an analysis for known issues, misconfigurations, security vulnerabilities (could be vulnerable packages, could be things like Spectre/Meltdown/L1TF) and so on. Support fairly frequently adds more "rules" for issues to check for to the Insights knowledge base, and they're typically tied to Customer Portal kbase articles. But the really cool thing is that to automate remediation of most issues, you can generate Ansible Playbooks or bash scripts, with an estimate of how "risky" the remediation might be. You can see and control the metadata it'll send to the Insights service, too.
Yeah, I work at Red Hat, but I legitimately think it's a useful tool and it sounds like it would address at least part of what you're looking for. If you've already got "Smart Management" subscriptions and a Satellite server, I think you should already have access to Insights, and you can set up the Satellite to act as a proxy to Insights and as a local UI to review reports and plan remediations. There's more info at https://access.redhat.com/products/red-hat-insights and https://www.redhat.com/en/technologies/management/insights.