cancel
Showing results for 
Search instead for 
Did you mean: 
Trevor
Commander Commander
Commander
  • 131 Views

Failed Task

A playbook is configured to be executed on a single host.

That playbook has a single play. 

That single play has 4 tasks.  

If the first task fails on that single host, will the remaining 3 tasks execute?

 

Trevor "Red Hat Evangelist" Chandler
Labels (2)
3 Replies
shashi01
Moderator
Moderator
  • 100 Views

@Trevor 

If the first task fails on a host, Ansible won’t run the next tasks on that host; it stops right there. That’s the default behavior to avoid making things worse after something goes wrong. But if you want Ansible to keep going even when something fails, you can tell it to do so by adding ignore_errors: yes to that task. A better way, though, is to use a block with rescue and always sections, which lets you handle the failure more cleanly and still run follow-up steps if needed. This approach is clearer and is also recommended as a best practice.

Chetan_Tiwary_
Community Manager
Community Manager
  • 76 Views

@Trevor Very important concept on Ansible error handling in playbooks.

By default, Ansible stops executing tasks on a host when a task fails on that host. 

You can use ignore_errors: yes on the failing task to allow the play to continue despite failure.

But the ignore_errors directive in Ansible only takes effect if a task actually runs and fails. It won’t help if the playbook hits problems like missing variables, connection issues, missing packages, or syntax errors in which case Ansible will still stop if those things happen.

You can use the ignore_unreachable keyword to skip task failures when a host is marked as ‘UNREACHABLE’. This lets Ansible bypass errors for those hosts and continue running tasks that follow, even if it couldn’t connect to them. Normally, when Ansible cannot reach a host, it flags the host as ‘UNREACHABLE’ and removes it from the active list for the remainder of the play.

Also, Ansible gives you control over task failures with the failed_when conditional. By default, if you specify a list of conditions under failed_when, the task only fails if all are true (implicit “and”). If you want the task to fail when any one condition is met, you need to combine them in a string using the explicit “or” operator.

 

Last but not the least : 

You can manage how Ansible handles task failures by grouping tasks in a block and using the rescue and always sections.
If a task inside a block fails, the tasks under rescue will run automatically, letting you handle errors in a way that is similar to exception handling in programming.

The always section runs tasks no matter what, whether there was an error or not.


Keep in mind, though - {{ if a failure is due to an invalid task (like a syntax mistake) or an unreachable host, neither rescue nor always will be triggered during playbook execution }}

 

https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_error_handling.html 

https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_blocks.html#block-error-handling 

Travis
Moderator
Moderator
  • 45 Views

@Chetan_Tiwary_ -

A very good and thorough overview of failures and using failures. I would like to add one more thing in here that I haven't always done, but leveraging larger jobs and workflows in AAP Controller has made me thing more about handling failures (think a job that could run for well over an hour or more and a failure towards the end ... how do you recover?).

Some modules can fail for reasons that have nothing to do with a system or playbook or even your local networking, so while block/rescue/always can sometimes help you recover it might not be sufficient for network blips or remote system load. So while it doesn't involve the fail keyword, I have found this to be invaluable lately.

    # Container image handling tasks
    - name: Pull images from external registries
      containers.podman.podman_image:
        name: "{{ item }}"
        validate_certs: false
        pull: true
      loop: "{{ container_images }}"
      retries: 5
      delay: 20

What does this task do ... so glad you asked.

Essentailly this is a great working task to pull images from remote container registries. What happens if it can't get an image, well it failsl. So assuming you are pulling a bunch of container images and those might be coming from the same registry, any number of things can happen ... 

  • Too many requests
  • Network load
  • Dropped packet
  • DNS error or routing error
  • Routers updated path temporarily during contact and can't reach the other end for a couple packets

Any of the above scenarios could cause the image download to fail. However, you know the images exist and you know the playbook works. One thing you can add at the module level in the task is the retries and delay option. What this will do is attempt to retry for X times and the delay will wait Y seconds before the retry. So one could argue that this does some self-healing from failures. This is different from block/rescue/always in those instances, the rescue will often try to fix something and get you to the end goal (yes, rescue could just try the same thing again - like installing a service because it failed the first time), but not always and the block/rescue/always is still a one and done.

The retries/delay gives a chance to try until it works and specify timeouts for things like API rate limiting and minor network glitches. So this can be very valuable for long running playbooks and workflows as rather than failing, it will stay at that task until it can complete successfully based on the parameters that you've provided.

Travis Michette, RHCA XIII
https://rhtapps.redhat.com/verify?certId=111-134-086
SENIOR TECHNICAL INSTRUCTOR / CERTIFIED INSTRUCTOR AND EXAMINER
Red Hat Certification + Training
Join the discussion
You must log in to join this conversation.