Anatomy of an Ansible Bug

Tracking down Ansible bugs becomes difficult when you are playing with issues between local and remote systems.

For the last couple days I was racking my brains why my database import script, written in Ansible, was not importing the data into the database. I had 3 separate imports, and only 1 was working. I looked at everything, but it wasn't until I walked away, came back, and realized that I had mistyped one character in the path to the dump.

But lo, didn't Ansible tell me that everything was fine with the import? Yes it did. I repeated this several times to make sure I was right. I had to figure out why Ansible was telling me that it had changed the remote system successfully when importing data into MySQL, when really it had failed out and done nothing. What it should have done was fail completely and stop the rest of the provisioning.

I dug into the code to figure out why. I could have turned up the verbosity of Ansible, but Ansible tends to output A LOT more information than required when communicating what it is doing on the remote system.

As I installed Ansible via Homebrew on OSX, I first had to find where the files were installed:

$ brew info ansible
ansible: stable 1.6.2, HEAD
http://www.ansible.com/home
/usr/local/Cellar/ansible/1.5.5 (798 files, 9.2M)
  Built from source
/usr/local/Cellar/ansible/1.6.2 (868 files, 9.9M) *
  Built from source
From: https://github.com/Homebrew/homebrew/commits/master/Library/Formula/ansible.rb

The code for the project is very well laid out and since it is written in Python, it is easy enough to read without worrying about compiled code.

$ cd /usr/local/Cellar/ansible/1.6.2
$ cd share/ansible/database
$ ls
mongodb_user          mysql_variables         redis
mysql_db              postgresql_db           riak
mysql_replication     postgresql_privs
mysql_user            postgresql_user

Open up the mysql_db file (yes, these are Python files even though the file extension is missing) and look for the db_import function.

def db_import(module, host, user, password, db_name, target, port, socket=None):
    cmd = module.get_bin_path('mysql', True)
    cmd += " --user=%s --password=%s" % (pipes.quote(user), pipes.quote(password))
    if socket is not None:
        cmd += " --socket=%s" % pipes.quote(socket)
    else:
        cmd += " --host=%s --port=%s" % (pipes.quote(host), pipes.quote(port))
    cmd += " -D %s" % pipes.quote(db_name)
    if os.path.splitext(target)[-1] == '.gz':
        cmd = 'gunzip < ' + pipes.quote(target) + ' | ' + cmd
    elif os.path.splitext(target)[-1] == '.bz2':
        cmd = 'bunzip2 < ' + pipes.quote(target) + ' | ' + cmd
    else:
        cmd += " < %s" % pipes.quote(target)
    rc, stdout, stderr = module.run_command(cmd, use_unsafe_shell=True)
    return rc, stdout, stderr

The problem exists in the following block:

if os.path.splitext(target)[-1] == '.gz':
    cmd = 'gunzip < ' + pipes.quote(target) + ' | ' + cmd
elif os.path.splitext(target)[-1] == '.bz2':
    cmd = 'bunzip2 < ' + pipes.quote(target) + ' | ' + cmd

These lines are executed if the target filename contains a .gz or .bz2 file extension. If this is the case, Ansible remotely commands the target machine to first run the file through an unzipping program (gunzip for .gz files, and bunzip2 for .bz2 files), then pipe the results into mysql.

If you run this command manually on a target machine you will notice the issue right away:

$ gunzip < /tmp/i-dont-exist.sql.gz | mysql -uroot testdb
-bash: /tmp/i-dont-exist.sql.gz: No such file or directory
$

Hmm. That's weird. It just exits immediately but it does show us an error message saying what we expect. A good thing to realize is that Ansible determines the success or failure of a task by the exit code that is returned by the command, not by what is output by the command. So let's try that again:

$ gunzip < /tmp/i-dont-exist.sql.gz | mysql -uroot testdb
-bash: /tmp/i-dont-exist.sql.gz: No such file or directory
$ echo $?
0
$

That's better. An exit code of 0 means success in Unix/Linux systems. Anything higher than 0 is a failure condition (in most cases). So if the remote machine is reporting 0, then Ansible is correct in reporting that the task finished successfully. But, we know that it should fail by the simple fact that there is an error message on the screen, so what happened?

The problem is that the pipe (|) operation swallowed the exit code from gunzip and instead used the exit code from mysql. If we break apart the two commands, you'll see the problem.

$ gunzip < /tmp/i-dont-exist.sql.gz
gzip: /tmp/i-dont-exist.sql.gz: No such file or directory
$ echo $?
1
$

So what is the best way to deal with a pipe that swallows an exit code? You perform any tests first and then only move to the next command if the test passes. If the test doesn't pass, it will return the correct exit code to the user. Unix systems have the test application for this. We are going to rewrite the command that Ansible uses:

$ test -e /tmp/i-dont-exist.sql.gz && gunzip < /tmp/i-dont-exist.sql.gz | mysql -uroot testdb
$ echo $?
1
$

Much better! There is no error message reported, which sucks for user feedback, but the exit code returned will now be recorded as a task failure by Ansible. We now know that the Ansible code in mysql_db should read:

# for gzip
cmd = 'test -e ' + pipes.quote(target) + ' && gunzip < ' + pipes.quote(target) + ' | ' + cmd

# for bzip2
cmd = 'test -e ' + pipes.quote(target) + ' && bunzip2 < ' + pipes.quote(target) + ' | ' + cmd

Let's step back for a second. I once had a mentee ask me "How do you get better at troubleshooting?" and the only answer I could give was to read lots of code and deal with a lot of different issues. There is no playbook for troubleshooting, and a lot of issues merely require a curious mind.

In this case, the bug in Ansible was not their fault, because the mysql command was swallowing the error exit code from gunzip and then giving a success exit code back to the end user. If blame is to be made, it is that the author of the Ansible code didn't realize that pipe operations swallow exit codes. That is a terribly easy mistake to make.

What's the next step? Report the bug to Ansible, but instead of just reporting it, we can now fix the error for them and submit it as a pull request. Not only does it make their lives easier, it also allows us to build a record of how we have contributed to other projects. The Ansible project is on Github at https://github.com/ansible/ansible and all you need to do is fork the project into your own repository.

Once the repository is forked into your own account, you can make the changes and then issue a pull request to the main Ansible project.

$ git clone https://github.com/scottbrown/ansible
... lots of info here we can ignore ...
$ cd ansible
# create a new branch exclusively for this fix
$ git checkout -b bugfix-1178
Switched to bugfix-1178
$ cd library/database
# make the fix to the file (see above), save, and exit
$ vi mysql_db
# add the file and commit the change to the repository
$ git add mysql_db
$ git commit -m 'BUGFIX 1178: some descriptive info here'
# now we push the branch to Github so that it can be seen by others
$ git push origin bugfix-1178

Once that is done, tell Github to create a pull request of your branch to their devel branch. Be sure to explain in detail everything that was changed as well as why the change is required. There will undoubtedly be some discussions in the comment of the pull request, but at least a fix is there to act as the starting point.

See how easy it is to contribute to open source projects? The bug wasn't difficult to fix, but it was something that most people won't encounter. Regardless, we've now made the Ansible project a bit more resilient to errors.

Now get out there and fix all those low-hanging fruit!

UPDATE

I submitted this change as a pull request and jimi-c was kind enough to propose a better solution. This is why I love the Github-style of pull requests. You get instant code review and can have a conversation/audit trail of any changes made.

The code doesn't read well when I bolted on the file existence check (using test -e) onto each branch of the conditional. That, and I'm repeating myself. Instead, he proposed I check for file existence in one place and using pure Python.

I didn't realize that, when using os.path.exists, Ansible was issuing commands on the remote machine, so I didn't use it the first time. But now I was able to add the following to the top of the function:

def db_import(module, host, user, password, db_name, target, port, socket=None):
    if not os.path.exists(target):
        return module.fail_json(msg="target %s does not exist on the host" % target)

    ... rest of function unchanged ...

Not only is this a better solution, but it also conforms to the concept of fail fast, in which if the target file doesn't exist on the remote system, the function shouldn't be doing any more processing (checking for attributes, values, etc.) and simply fail and return an error message. This saves time and processing power.

In addition to this improved code, when Ansible fails it now reports the exact problem to the end user. This is something that was missing from my solution and it felt wrong (which is a good sign that it isn't the best solution).

TASK: [bugfix | import data from non-existent file] ***************************
failed: [default] => {"failed": true, "item": ""}
msg: target /tmp/missing.sql.bz2 does not exist on the host

All in all, I call this a good day for coding.