Azure Storage Explorer on Linux
April 25, 2019
Show all

Azure Linux – troubleshooting the “no boot/no ssh” scenario

repair vm

"no boot/no ssh" scenario

Azure Cloud silently released some new features to help everyday Azure Cloud Administrators with troubleshooting the “no boot/no ssh” scenario.

The efforts into this direction started about 2 years ago with the public release of the “Serial Console” feature, which finally allowed customers to reach out to their VMs from the Azure Portal, without the need of an SSH connection.

This was continued with a nice blog post explaining how this new feature can be used and how to prepare your VMs to use it.

Meanwhile, it looks like some of the vendors catch up with this new option and the new images available in Azure Marketplace are already configured to make use of this new feature.

However, since a while I’ve seen a new option in the Azure Portal, that  was available through scripting from a while already, giving you, finally, the option to have them at a click of a button:

  • the “Swap OS  disk”
  • the “Encryption”  option which allows you to easily apply encryption on your disks – we will discuss this with another occasion.
Azure SwapOS

“no boot/no ssh” scenario

So..what’s the usage of all this?

The “Swap OS disk”  is one of the useful options in troubleshooting the “no boot/no ssh” scenario.

So, how that works?

For the “Swap OS disk”, in case you can’t access your VM or your VM is no longer booting right, you need to follow some simple steps:

  1. Create a rescue VM, as similar as possible to the originally affected VM
  2. Create a snapshot from the Affected VM OS disk
  3. From the snapshot, deploy a new disk (make sure it has the same characteristics as the original OS disk on the affected VM)
  4. Attach the new disk to the rescue VM.
  5. Investigate and solve the issue on the affected disk
  6. Once the issue is found and fixed, detach the disk from the rescue VM
  7. Finally, use the “Swap OS disk” option to replace the original OS disk on the affected VM with the one you just fixed on the rescue VM.
  8. Reboot the affected VM

Sounds easy, no? All the above steps, except the step 5, which, of course, requires working on the copy of the affected VM OS disk via the rescue VM, can be performed from Azure Portal by simply using your mouse.

But to make it even faster, we have the “az vm repair” option.

The process is:

1. Open the CloudShell on your Azure Portal page (enable it if needed) and switch to “Bash”

2. Enable the “vm-repair” extension

az extension add -n vm-repair

3. Create a rescue VM and attach a copy of the affected VM OS disk to the rescue VM:

az vm repair create -g "affected_vm_RG_name"  -n  "affected_VM_name"

The above will copy the OS disk from a problematic VM and also creates a new rescue VM using the same characteristics of the original VM, attaching the copy of the disk and even prompting you for a username and password (12 characters minimum, at least one upper case character and a number required) for the repair VM.

4. Now you can connect to the rescue VM and simply fix the issue. Once the problem is solved, you can again take advantage of the “az vm repair restore”  instead of using the “swap os disk” API to swap the OS disk back:

az vm repair restore -g "affected_vm_RG_name" -n "affected_VM_name"

The above will automatically search for the rescue VM and use the attached data disk on that VM to replace the OS disk on the affected VM, then it will prompt you to delete the rescue-vm and related resources.

So, instead of searching for buttons to click in the Azure Portal, by knowing only the VM Name and Resource Group, you can run 3 simple commands via AzCLI 2.0 using the Azure CloudShell and you have your repair environment up & ready in a matter of 2-3 minutes.

Now, you have new tools available for your troubleshooting, therefore, enjoy using them if you ever need them!


Later edit:

I worked together with a friend on an AzCLI 2.0  script that makes this even easier for you: https://github.com/marinnedea/Repair-and-Restore-VM

The script simply prompts you for the subscription ID, the Resource Group name and the VM name for the affected VM, then creates the rescue environment for you and lets you know the IP of the rescue VM so you can easily ssh into it and further troubleshooting the “no boot/no ssh” scenario.

Once done, you just need to call the same script and tell it you wish to restore the VM. It will use the data already provided during the rescue VM creation phase, so you don’t need to provide that information again.

Enjoy!

Marin Nedea
Marin Nedea
I'm passionate about open source software and technologies. In my spare time I build simple and functional websites from scratch, using PHP+HTML5+CSS3+MySQL and when I'm bored, I write simple PHP_CLI or bash scripts to play around on my Linux machine.

Leave a Reply