Debugging bare metal

Practical Tips and Tricks on DEBUGGING OpenStack Ironic

 

Dmitry Tantsur

http://dtantsur.github.io/talks/fosdem2016/

Debugging Bare Metal

General Schema

  1. Self-sanity check: did I follow all the required steps?
  2. Know fragile parts (and when they break).
  3. Divide and conquer: figure out what part of deployment procedure failed.
    1. Pre-deployment validation and stack creation (TripleO and Heat)
    2. Bare metal scheduling (Nova and Ironic)
    3. Overcloud image deployment (Ironic)
    4. Overcloud node configuration (Heat, Puppet)
  4. Learn the current state: use related services commands to see where we are now.
  5. Grab the logs: look for details in the appropriate logs.

Debugging Bare Metal

General Schema

  1. Self-sanity check: did I follow all the required steps?
  2. Know fragile parts (and when they break).
  3. Divide and conquer: figure out what part of deployment procedure failed.
    1. Pre-deployment validation and stack creation (TripleO and Heat)
    2. Bare metal scheduling (Nova and Ironic)
    3. Overcloud image deployment (Ironic)
    4. Overcloud node configuration (Heat, Puppet)
  4. Learn the current state: use related services commands to see where we are now.
  5. Grab the logs: look for details in the appropriate logs
  6. ...
  7. Complain that OpenStack is too complex

Debugging Bare Metal

Covered in this Talk

  1. Self-sanity check: did I follow all the required steps?
  2. Know fragile parts (and when they break).
  3. Divide and conquer: figure out what part of deployment procedure failed.
    1. Pre-deployment validation and stack creation (TripleO and Heat)
    2. Bare metal scheduling (Nova and Ironic)
    3. Overcloud image deployment (Ironic)
    4. Overcloud node configuration (Heat, Puppet)
  4. Learn the current state: use related services commands to see where we are now.
  5. Grab the logs: look for details in the appropriate logs
  6. ...
  7. Complain that OpenStack is too complex

Debugging Bare Metal

Fragile Parts: Where Things Break

Introspection

  1. Valid Ironic node state for introspection is "manageable", maintenance should be disabled.
    TripleO command 'openstack baremetal introspection bulk start' also works with "available".
    ... yes, it's a TripleO command, it's not related to Ironic Inspector itself!
  2. Ironic Inspector sets up a DHCP+iPXE server listening to requests from bare metal nodes.
    The most common problem with introspection is that nodes cannot reach the DHCP server.
    The reasons include
    1. firewall settings
    2. switch configurations
    3. NIC ordering in BIOS
    4. PXE configuration in BIOS
    5. another DHCP server listening on the provisioning network.
  3. Ancient iPXE ROM (e.g. in CentOS 7).
    The biggest known problem with itis that you can't introspect a node with several NIC's on the provisioning network.

Debugging Bare Metal

Fragile Parts: Where Things Break

Scheduling

  1. Nova contacting Ironic for free node information.
    If this breaks, no nodes will be available for deployment without obvious reasons.
  2. Ironic node states: only "available" nodes with maintenance disabled are used.
  3. Nova matching requested hardware properties (CPU count and architecture, RAM and disk sizes) from the flavor against Ironic node information.
  4. Nova matching requested capabilities in the flavor against those in Ironic node information.
    In case of TripleO/RDO-Manager this includes "boot_option" and "profile" capabilities.
  5. Nova waiting for hypervisor stats update after some nodes are made available.
    It takes up to 2 minutes for Nova to learn about newly available nodes.

Debugging Bare Metal

Fragile Parts: Where Things Break

Preparing for Deployment

  1. When Ironic fails to deploy, Nova usually cleans everything up and reschedules, meaning you won't get any helpful error messages from it. Check "ironic-conductor" logs if Nova says something like "Exceeded max scheduling attempts 3 for instance <UUID>".
  2. Ironic needs enough disk space to download all images (and sometimes convert them).
  3. Ironic needs access to the bare metal node management interface (BMC) for many tasks during deployment. Some management interface implementations are known to lock up from time to time, resulting in deployment failures and nodes going to maintenance mode.
  4. Ironic has different drivers. Some hardware is supported by more than one of them.
    Make sure to carefully choose a driver to use: vendor-specific drivers (like pxe_drac or pxe_ilo) are usually preferred over more generic ones (like pxe_ipmitool). Unless they don't work :)
  5. Use 'ironic driver-list' command to ensure that some Ironic conductors support your driver.
    And that there are some working Ironic conductors at all.

Debugging Bare Metal

Fragile Parts: Where Things Break

Deploying

  1. A bare metal node need to DHCP from the Neutron DHCP server and PXE-boot from Ironic conductor.
  2. Ancient iPXE ROM again (e.g. in CentOS 7) may get in way.
  3. Setting up bootloader is known to break sometimes.
    Sometimes it helps to wipe the existing hard drive partitioning before deploying.
    If nothing works, disable local boot by removing "boot_option" capability from both node properties and flavor.

Debugging Bare Metal

Divide and Conquer

Check Heat Resouces

$ heat resource-list overcloud | grep FAIL

Check Instance States

$ nova list
$ nova show <UUID or NAME>

Check Bare Metal Node States

$ ironic node-list
$ ironic node-show <UUID or NAME>

Debugging Bare Metal

Checking Heat Resources

Questions to ask

  1. Do we have Heat stack at all?
    Answer "no" means that pre-deploy validation has failed
  2. Do we have failed resources?
    Answer "no" means that the deployment was successful or we have a bug in TripleO
  3. What is the type of failed resources?

Debugging Bare Metal

Checking Instance States

Questions to ask

  1. Do we have Nova instances at all?
    Answer "no" means failure on earlier stage, probably during validation.
  2. What is instance status?
    Answer "ERROR" means that deployment for a particular instance has failed.

Debugging Bare Metal

Checking Bare Metal Node States

Questions to ask

  1. What is bare metal node provisioning state?
    Answer "available" means node is ready to be deployed on.
    Answer "error" means node deployment has failed.
    Answer "manageable" means node was not designated for deployment.
    Answer "active" means node was successfully deployed.
  2. What is bare metal node maintenance flag value?
    Answer "True" means that node can't be used for deployment, and probably requires operator intervention.
  3. What is bare metal node instance UUID?
    If one is assigned to a non-active node, it is an issue that requires manual intervention.
  4. Did all node succeed to deploy?
    Answer "yes" means that the failure in question happened during configuration step.

Debugging Bare Metal

Looking for Logs

  1. System journal stores the latest logs and is easier to analyse
  2. Logs for longer period of time can be found as text files in /var/log

How To

Get all recent errors from Nova compute process (responsible for interacting with Ironic):

journalctl -u openstack-nova-compute --since '1 day ago' | grep ERROR

Get all Ironic and Nova logs records concerning node represented by a given UUID:

journalctl -u openstack-ironic-conductor -u openstack-nova-compute | grep 03731c5b-c72a-419e-9716-a60755019519

Get all introspection logs from Inspector (including DHCP) merged with Ironic logs:

journalctl -u openstack-ironic-conductor -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq

Get all introspection logs without noisy debug and iptables logs:

journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq | grep -v iptables | grep -v DEBUG

Note

Usage of "sudo" or root access is assumed in all commands above.

Case Study: Management Failure

Debugging Bare Metal

Checking Heat Resources

$ heat resource-list overcloud | grep FAIL
| Compute                                   | 31438bde-6ffe-493e-a424-e6ab98d4c4d1          |
 OS::Heat::ResourceGroup | CREATE_FAILED   | 2016-01-15T15:11:34 |
| Controller                                | bb1648c5-5d9a-4465-bda0-064efe1a3c2d          |
 OS::Heat::ResourceGroup | CREATE_FAILED   | 2016-01-15T15:11:34 |

Findings

Both Compute and Controller roles have failed to deploy.

Other resources are probably not related to bare metal deployment.

Debugging Bare Metal

Checking Instance States

$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID                                   | Name                    | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| aa27438e-e7ea-4746-92f8-dc394b6abb55 | overcloud-controller-0  | ERROR  | -          | NOSTATE     |          |
| ded02e79-6c9e-416a-a4df-e97e0fb66f6e | overcloud-novacompute-0 | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
$ nova show overcloud-novacompute-0 | grep ' fault '
| fault                                | {"message": "No valid host was found.
  There are not enough hosts available.", "code": 500, "details": "
  File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 739, in build_instances |

Findings

"No valid host found" means one of:

  1. Nova unable to find a suitable bare metal node
  2. All matching nodes were tried, and all failed to deploy

Debugging Bare Metal

Checking Bare Metal Node States

$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name   | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None          | None        | available          | True        |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None          | None        | available          | True        |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+

Findings

Maintenance mode is set on a node to mark it as requiring operator's intervention.

A node goes to maintenance mode automatically if Ironic is no longer able to connect to its management interface.

Empty power state field supports this hypothesis.

Debugging Bare Metal

Checking Bare Metal Node States

$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name   | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None          | None        | available          | True        |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None          | None        | available          | True        |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
$ ironic node-show node-0 | grep maintenance_reason -A3
| maintenance_reason     | During sync_power_state, max retries exceeded for node 03731c5b-c72a-  |
|                        | 419e-9716-a60755019519, node state None does not match expected state  |
|                        | 'power off'. Updating DB state to 'None' Switching node to maintenance |
|                        | mode.

Findings

Failed "sync_power_state" operation means that Ironic is unable to connect to node management interface.

  1. Check that power credentials are still valid
  2. Ensure that node's BMC has not locked up, reset it if needed
  3. Research logs for more details on the problem.
    E.g. sometimes ipmitool crashes with IPMI v2. The workaround is to switch to IPMI v1.5 or to a vendor-specific driver.

Case Study: Stack Deletion with a Failed Node

Debugging Bare Metal

Symptoms

Sometimes operators have to physically remove a broken node from the cloud. In this case manual removal maybe required, as stack deletion won't work:

$ heat stack-delete overcloud
$ heat stack-list
+--------------------------------------+------------+---------------+---------------------+--------------+
| id                                   | stack_name | stack_status  | creation_time       | updated_time |
+--------------------------------------+------------+---------------+---------------------+--------------+
| 8223f727-d618-4776-9c7e-7f8924bc2d9e | overcloud  | DELETE_FAILED | 2016-01-15T15:11:34 | None         |
+--------------------------------------+------------+---------------+---------------------+--------------+

Debugging Bare Metal

Checking Node States

$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID                                   | Name                    | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| 0bfae06b-9c1c-4559-a033-c08bead6c884 | overcloud-novacompute-0 | ERROR  | - | Running     |          |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
$ nova show overcloud-novacompute-0 | grep ' fault '
| fault                                | {"message": "Failed to validate power driver interface. Can not delete instance.
Error: SSH connection cannot be established: Failed to establish SSH connection to host 192.168.122.1. (HTTP 500)",
"code": 500, "details": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 366, in decorated_function |
$ ironic node-list
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name   | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None                                 | power off   | available          | False       |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | 0bfae06b-9c1c-4559-a033-c08bead6c884 | None        | active             | True        |
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+

Findings

We have removed a node from operation, but Ironic and Nova both still remember it and try to clean up the instance. We need to delete it before we can delete the stack.

Debugging Bare Metal

The Solution

$ ironic node-set-maintenance node-1 on
$ ironic node-update node-1 remove instance_uuid
$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name   | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None          | power off   | available          | False       |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None          | None        | active             | True        |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
$ heat stack-delete overcloud

You may have to repeat the last command several times.

Debugging Bare Metal

Ask for help (and complain)

  1. IRC: #openstack-ironic on Freenode.
  2. Bug tracker: https://bugs.launchpad.net/ironic



Thank you for attention

http://dtantsur.github.io/talks/fosdem2016/