Azure Cloud silently released some new features to help everyday Azure Cloud Administrators with troubleshooting the “no boot/no ssh” scenario.
The efforts into this direction started about 2 years ago with the public release of the “Serial Console” feature, which finally allowed customers to reach out to their VMs from the Azure Portal, without the need of an SSH connection.
This was continued with a nice blog post explaining how this new feature can be used and how to prepare your VMs to use it.
Meanwhile, it looks like some of the vendors catch up with this new option and the new images available in Azure Marketplace are already configured to make use of this new feature.
However, since a while I’ve seen a new option in the Azure Portal, that was available through scripting from a while already, giving you, finally, the option to have them at a click of a button:
So..what’s the usage of all this?
The “Swap OS disk” and the “az vm repair” options are useful in troubleshooting the “no boot/no ssh” scenario.
So, how that works?
For the “Swap OS disk”, in case you can’t access your VM or your VM is no longer booting right, you need to follow some simple steps:
Sounds easy, no? All the above steps, except the step 5, which, of course, requires working on the copy of the affected VM OS disk via the rescue VM, can be performed from Azure Portal by simply using your mouse.
But to make it even faster, we have the “az vm repair” option.
The process is:
1. Open the CloudShell on your Azure Portal page (enable it if needed) and switch to “Bash”
2. Enable the “vm-repair” extension
az extension add -n vm-repair
3. Create a rescue VM and attach a copy of the affected VM OS disk to the rescue VM:
az vm repair create -g "affected_vm_RG_name" -n "affected_VM_name"
The above will copy the OS disk from a problematic VM and also creates a new rescue VM using the same characteristics of the original VM, attaching the copy of the disk and even prompting you for a username and password (12 characters minimum, at least one upper case character and a number required) for the repair VM.
4. Now you can connect to the rescue VM and simply fix the issue. Once the problem is solved, you can again take advantage of the “az vm repair restore” instead of using the “swap os disk” API to swap the OS disk back:
az vm repair restore -g "affected_vm_RG_name" -n "affected_VM_name"
The above will automatically search for the rescue VM and use the attached data disk on that VM to replace the OS disk on the affected VM, then it will prompt you to delete the rescue-vm and related resources.
So, instead of searching for buttons to click in the Azure Portal, by knowing only the VM Name and Resource Group, you can run 3 simple commands via AzCLI 2.0 using the Azure CloudShell and you have your repair environment up & ready in a matter of 2-3 minutes.
Now, you have new tools available for your troubleshooting, therefore, enjoy using them if you ever need them!
I worked together with a friend on an AzCLI 2.0 script that makes this even easier for you: https://github.com/marinnedea/Repair-and-Restore-VM
The script simply prompts you for the subscription ID, the Resource Group name and the VM name for the affected VM, then creates the rescue environment for you and lets you know the IP of the rescue VM so you can easily ssh into it and further troubleshooting the “no boot/no ssh” scenario.
Once done, you just need to call the same script and tell it you wish to restore the VM. It will use the data already provided during the rescue VM creation phase, so you don’t need to provide that information again.