Did you ever need to troubleshoot an Azure Linux “no boot/no ssh” scenario?
Azure started making efforts into this direction started about 2 years ago, with the public release of the “Serial Console” . As a result, customers were able to reach out to their VM’s from the Azure Portal, without the need of an SSH connection.
This was continued with a nice blog post, explaining how this new feature can be used and how to prepare your VM’s to use it.
The Azure Portal has also some new option – the “Swap OS disk” and the “Disk Encryption”.
Therefore, what was possible through scripting only, is now just at a click of a button away:
Azure Linux “no boot/no ssh” scenario
So..what’s the usage of all this?
The “Swap OS disk” is one of the useful options in troubleshooting the “no boot/no ssh” scenario.
How that works?
For the “Swap OS disk”, in case you can’t access your VM or your VM is no longer booting right, you need to follow some simple steps:
- Create a rescue VM, as similar as possible to the originally affected VM
- Create a snapshot from the Affected VM OS disk
- From the snapshot, deploy a new disk (make sure it has the same characteristics as the original OS disk on the affected VM)
- Attach the new disk to the rescue VM.
- Investigate and solve the issue on the affected disk
- Once the issue is found and fixed, detach the disk from the rescue VM
- Finally, use the “Swap OS disk” option to replace the original OS disk on the affected VM with the one you just fixed on the rescue VM.
- Reboot the affected VM
Sounds easy, no? All the above steps, except the step 5, which, of course, requires working on the copy of the affected VM OS disk via the rescue VM, can be performed from Azure Portal by simply using your mouse.
But to make it even faster, we have the “az vm repair” option.
The process is:
1. Open the CloudShell on your Azure Portal page (enable it if needed) and switch to “Bash”
2. Enable the “vm-repair” extension
az extension add -n vm-repair
3. Create a rescue VM and attach a copy of the affected VM OS disk to the rescue VM:
az vm repair create -g "affected_vm_RG_name" -n "affected_VM_name"
The above will copy the OS disk from a problematic VM. It will also create a new rescue VM using the same characteristics of the original VM. In the end, it will attach the copy of the disk to the rescue VM. During the process it will ask you to set an username and a passwordfor the repair VM.
4. Now you can connect to the rescue VM and simply fix the issue. Once the problem is solved, you can again take advantage of the “az vm repair restore” , instead of using the “swap os disk” API to swap the OS disk back:
az vm repair restore -g "affected_vm_RG_name" -n "affected_VM_name"
The above will:
- automatically identify the repair VM;
- detach the data disk on the repair VM;
- replace the OS disk on the affected VM, using the “swap os disk” feature
- it will prompt you to delete the repair VM and related resources.
Therefore, instead of searching for buttons to click in the Azure Portal, you can run 3 simple commands via AzCLI 2.0 and be done in several minutes.
No headache, no complicated bash or powershell command, no scripts.
Now, you have new tools available for your troubleshooting, therefore, enjoy using them if you ever need them!
I worked together with a friend on an AzCLI 2.0 script that makes this even easier for you: https://github.com/marinnedea/Repair-and-Restore-VM
The script simply prompts you for the subscription ID, the Resource Group name and the VM name for the affected VM, then creates the rescue environment for you and lets you know the IP of the rescue VM so you can easily ssh into it and further troubleshooting the “no boot/no ssh” scenario.
Once done, you just need to call the same script and tell it you wish to restore the VM. It will use the data already provided during the rescue VM creation phase, so you don’t need to provide that information again.