Data & System Recovery Considerations on Azure Confidential Computing (ACC)

If you’ve selected Azure Confidential Computing (ACC) for your workload, it’s because you demand the highest level of security available in the cloud for your most sensitive data. This extra security is great, but you should keep in mind a few considerations about how the system works to ensure your system’s availability and avoid potential data loss - especially if you’ve enabled Confidential OS Disk Encryption.
As with all production deployments, you should make sure you have a backup strategy in place and test your backups regularly.
A review of how it all works
Confidential Computing features, such as AMD SEV-SNP and Intel TDX are enabled when running supported kernels in virtual machines on supported hypervisors. This is handled automatically when launching a Confidential-enabled Rocky Linux from CIQ (RLC) image onto a confidential-supporting instance on Azure.
The process begins the same way as any Trusted Launch (Secure Boot)-enabled boot:
- The UEFI firmware first loads the Secure Boot shim, which was signed by Microsoft and carries its own certificate.
- The shim then verifies and loads your operating system’s kernel and proceeds through the rest of the boot process.
During boot, the hash values of these key system components are measured into the system Platform Configuration Registers (PCRs), which are located inside the Trusted Platform Module (TPM). Applications can reference these PCRs through applicable APIs, which are involved in verifying system integrity such as for confidential guest attestation or for automatic volume decryption.
The TPM can also be used to generate and store cryptographic keys securely and make them available only in select cases. This comes into use when selecting Confidential OS Disk Encryption, which enables LUKS on your OS partition and configures the system to automatically load the appropriate key from the TPM after a verified boot process. This gives you encryption in use and at rest while allowing your system to boot up normally without having to manually enter a password.
To ensure a secure system, the encryption key is sealed to the initial expected PCR values based on your system’s UEFI firmware, the shim, and the kernel certificates and signatures. This step is performed by the Azure hypervisor during the provisioning process before the first boot so that encryption is ready and running as soon as the system boots the first time.
Making changes to your system’s secure boot components, such as replacing your Secure Boot shim, changing kernels, or installing certain types of drivers or modules, could inadvertently change the PCR measurements causing issues with the automatic unlocking process.
Below we outline these situations and how to avoid or resolve them:
PCR mismatches
During boot, the secure boot component’s signatures are measured and loaded into the previously mentioned TPM PCRs. When using Confidential OS Disk Encryption, at the point where it’s time to open up the encrypted volume, systemd queries the TPM for the sealed key.
If the PCRs currently loaded from the most recent boot exactly match the values used to seal the key, everything works seamlessly: the TPM releases the key to the operating system, the LUKS volume unlocks and decrypts automatically, and you complete the boot process. Everything must be exactly the same as when the key was sealed for a successful boot.
If anything at all is different, the TPM won’t release the key. This can happen even if the chain of trust is still valid but different from what was expected, such as with certain initrd changes, adding additional trusted keys, and making other configurations to the Secure Boot chain of trust.
If you end up in the situation where the PCR values don’t match, your VM will enter a running state but won’t be accessible by normal means such as SSH. In the Azure Serial Console for your instance, you’ll see a prompt for a fallback passphrase:
Enter the backup passphrase and your volume will unlock and you’ll boot the rest of the way normally, and you can reseal keys or revert your configuration change to resolve the problem.
Very important:
If you’re using a customer-managed key backed by an Azure Premium HSM, you might have this passphrase handy, depending on how you created and loaded your keys.
If you’ve selected a platform-managed key, however, that passphrase is never shown – you would be completely locked out of the system forever if this happens. Fortunately, you can create your own fallback passphrase to avoid being locked out. This will let you continue the boot process from the serial console in the event of an error.
CIQ strongly recommends creating your own fallback passphrase to ensure you have emergency access to your system. Do this as one of your first tasks after launching a new instance.
Add your own fallback passphrase to the TPM using this command:
cryptsetup luksAddKey /dev/sdX --token-type systemd-tpm2
The value for /dev/sdX is the device with type crypt
from lsblk
. Enter and confirm your new secure passphrase, then print it out and store it in a secure location like a safe.
To test the passphrase you just created, we’ve provided an inactive certificate on the system. You can trigger a PCR change, which will prompt for the fallback passphrase next time you reboot by moving this inactive certificate as below:
cd /boot/efi/EFI/rocky
mv shim_certificate.efi.bak shim_certificate.efi
reboot
Reboot and use the serial console to enter the passphrase. Rename the certificate back and reboot again, and you’ll be back to normal:
mv shim_certificate.efi shim_certificate.efi.bak
With certain types of driver or module installations, it might be necessary to reseal against the new PCRs. If you’re a customer eligible for CIQ Standard or Premium Support, contact us for assistance. If resealing is required by a CIQ update, we’ll handle that for you as a part of the update packaging.
With the fallback passphrase, you can bring your system back online to operate normally while getting the issue resolved, so it is not a critical emergency if you end up in this situation. It will just require a bit of manual intervention during boot.
You should also consider creating a recovery key for your encrypted volume, which would let you attach your system’s VHD to another, then mount and unlock the drive to access the data in an emergency. If you end up in a PCR-mismatch scenario, where the system prompts you to continue and you don’t have either the passphrase or a recovery key, cryptography will do its job and leave you with no way to access your data. You’d need to restore from a recent backup or otherwise rebuild from scratch.
Shim changes
Secure Boot shims are small pieces of code that primarily serve to build the chain of trust between the UEFI firmware and the operating system. They’re signed by the UEFI firmware certificate and hold the chain of trust for the OS components.
Updates to shims are very rare and are typically only needed if there is a critical CVE against the integrity of Secure Boot protections. The shim is signed by Microsoft, and no other signatures will work.
Don’t install a shim from any other source, and don’t make any changes to the shim unless they’re explicitly called for. If a shim update is needed to maintain your system’s security, CIQ will provide an update procedure.
Shim / Kernel signature mismatches
Your kernel must be signed by a certificate trusted by the shim for Secure Boot. If you try to launch a Linux kernel with a signature that isn’t trusted by the shim, this is a Secure Boot violation and the system won’t boot. The most common reason to end up in this situation is installing a different kernel than the one provided.
The shim and kernel provided on your RLC system are both signed by CIQ. If you install a kernel signed by another party or an unsigned kernel, the system will break. This includes any kernel signed by the RESF (community developers of Rocky Linux) as well as upstream kernels from kernel.org, kernels provided by another vendor or repository, or kernels you’ve compiled yourself.
We’ve blocked accidental ways to get into this situation, but it is possible to end up here with determination. If it does happen, you may be able to recover by mounting the disk to another system and manually editing /boot
to put the correct kernel back and update the boot order.
CIQ strongly recommends against using any kernel on your system that wasn’t installed by default.
Conclusion
Azure Confidential Computing offers top security for your most sensitive workloads, but extra pitfalls accompany this extra security, which can trip you up if you’re not careful.
CIQ has taken active steps to keep you from encountering any of these problems. For the best performance, security, and reliability, we recommend using 5th or 6th generation DC- and EC-family instance types. Older instance types may not support the full suite of Confidential Computing technologies available today.
We recommend pairing your RLC fleet deployed on Azure with CIQ Standard or Premium Support if you need further help or guidance working through these issues, and other aspects of OS configuration to help you run with the highest level of security and stability.