$ heat resource-list overcloud | grep FAIL
$ nova list
$ nova show <UUID or NAME>
$ ironic node-list
$ ironic node-show <UUID or NAME>
Get all recent errors from Nova compute process (responsible for interacting with Ironic):
journalctl -u openstack-nova-compute --since '1 day ago' | grep ERROR
Get all Ironic and Nova logs records concerning node represented by a given UUID:
journalctl -u openstack-ironic-conductor -u openstack-nova-compute | grep 03731c5b-c72a-419e-9716-a60755019519
Get all introspection logs from Inspector (including DHCP) merged with Ironic logs:
journalctl -u openstack-ironic-conductor -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq
Get all introspection logs without noisy debug and iptables logs:
journalctl -u openstack-ironic-inspector -u openstack-ironic-inspector-dnsmasq | grep -v iptables | grep -v DEBUG
Usage of "sudo" or root access is assumed in all commands above.
$ heat resource-list overcloud | grep FAIL
| Compute | 31438bde-6ffe-493e-a424-e6ab98d4c4d1 |
OS::Heat::ResourceGroup | CREATE_FAILED | 2016-01-15T15:11:34 |
| Controller | bb1648c5-5d9a-4465-bda0-064efe1a3c2d |
OS::Heat::ResourceGroup | CREATE_FAILED | 2016-01-15T15:11:34 |
Both Compute and Controller roles have failed to deploy.
Other resources are probably not related to bare metal deployment.
$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| aa27438e-e7ea-4746-92f8-dc394b6abb55 | overcloud-controller-0 | ERROR | - | NOSTATE | |
| ded02e79-6c9e-416a-a4df-e97e0fb66f6e | overcloud-novacompute-0 | ERROR | - | NOSTATE | |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
$ nova show overcloud-novacompute-0 | grep ' fault '
| fault | {"message": "No valid host was found.
There are not enough hosts available.", "code": 500, "details": "
File \"/usr/lib/python2.7/site-packages/nova/conductor/manager.py\", line 739, in build_instances |
"No valid host found" means one of:
$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None | None | available | True |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None | None | available | True |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
Maintenance mode is set on a node to mark it as requiring operator's intervention.
A node goes to maintenance mode automatically if Ironic is no longer able to connect to its management interface.
Empty power state field supports this hypothesis.
$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None | None | available | True |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None | None | available | True |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
$ ironic node-show node-0 | grep maintenance_reason -A3
| maintenance_reason | During sync_power_state, max retries exceeded for node 03731c5b-c72a- |
| | 419e-9716-a60755019519, node state None does not match expected state |
| | 'power off'. Updating DB state to 'None' Switching node to maintenance |
| | mode.
Failed "sync_power_state" operation means that Ironic is unable to connect to node management interface.
Sometimes operators have to physically remove a broken node from the cloud. In this case manual removal maybe required, as stack deletion won't work:
$ heat stack-delete overcloud
$ heat stack-list
+--------------------------------------+------------+---------------+---------------------+--------------+
| id | stack_name | stack_status | creation_time | updated_time |
+--------------------------------------+------------+---------------+---------------------+--------------+
| 8223f727-d618-4776-9c7e-7f8924bc2d9e | overcloud | DELETE_FAILED | 2016-01-15T15:11:34 | None |
+--------------------------------------+------------+---------------+---------------------+--------------+
$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
| 0bfae06b-9c1c-4559-a033-c08bead6c884 | overcloud-novacompute-0 | ERROR | - | Running | |
+--------------------------------------+-------------------------+--------+------------+-------------+----------+
$ nova show overcloud-novacompute-0 | grep ' fault '
| fault | {"message": "Failed to validate power driver interface. Can not delete instance.
Error: SSH connection cannot be established: Failed to establish SSH connection to host 192.168.122.1. (HTTP 500)",
"code": 500, "details": " File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 366, in decorated_function |
$ ironic node-list
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None | power off | available | False |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | 0bfae06b-9c1c-4559-a033-c08bead6c884 | None | active | True |
+--------------------------------------+--------+--------------------------------------+-------------+--------------------+-------------+
We have removed a node from operation, but Ironic and Nova both still remember it and try to clean up the instance. We need to delete it before we can delete the stack.
$ ironic node-set-maintenance node-1 on
$ ironic node-update node-1 remove instance_uuid
$ ironic node-list
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
| 03731c5b-c72a-419e-9716-a60755019519 | node-0 | None | power off | available | False |
| 7c9d7690-45f8-4f7d-a6d1-0ecddf6bd6ec | node-1 | None | None | active | True |
+--------------------------------------+--------+---------------+-------------+--------------------+-------------+
$ heat stack-delete overcloud
You may have to repeat the last command several times.