VIRL – Timed Out Waiting For A Reply to Message
Last Updated: [last-modified] (UTC)
The Problem
A few days ago, I updated VIRL. Standard upgrade, nothing went wrong. Shortly after that, some server guys shutdown the host that VIRL lives on. This caused some problems.
One problem occured when starting a simulation. The simulation would start, but the nodes would be in the BUILDING phase for too long:

After a while, they would timeout. The status would change from BUILDING to ERROR. The status would be Timed out waiting for a reply to message ID id-number.

To troubleshoot this, I logged onto the UWM console. Then I browsed to VIRL Server -> System Tools, and opened System Operation Check.
I clicked the Run Tests button. I found that some services weren’t running. This also causes the simulation tests to be skipped.


Nova and Neutron services are failing. To dig deeper, I went to VIRL Server -> System Tools and opened the System Console. From there I checked the service status, which was being reported as up. I tried restarting the services anyway, just in case.
Rerunning the test still shows an error. The exact service in the error would change, but they would always be Nova and Neutron related.

The Solution
The next level of troubleshooting involves checking the state of these services.
To check Nova services, run nova service-list (don’t use sudo).

Some of the services are showing as down. The one I restarted earlier is up though. Restarting the remaining services brings them all into an up state.
For Neutron, neutron agent-list shows that some of the services are not alive. The only one alive is the one I restarted earlier.

After restarting the rest of the services, they are all listed as alive.
Conclusion
Sometimes the OpenStack services do not come up correctly after an update. This may even happen after a few reboots. Use the service-list and agent-list commands to find the ones causing the errors.
After this, rerun the System Operation Check in UWM, and confirm that there’s no errors.