Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#304
Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#3041008covingtonlane wants to merge 6 commits into
Conversation
New Environment Validator troubleshooting guide for the Hardware Secure Boot check (Test-SecureBoot, which runs Confirm-SecureBootUEFI). Covers where the failure appears (portal, on-box validator, and the AzStackHciEnvironmentChecker event log), how to enable UEFI Secure Boot in firmware, and a BitLocker precaution: suspend BitLocker before the firmware change and resume after, since Azure Local enables data-at-rest encryption by default and a Secure Boot change is measured into TPM PCR 7. Indexed in the EnvironmentValidator README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Environment Validator troubleshooting guide (TSG) for the Azure Local Secure Boot hardware validation check (AzStackHci_Hardware_Test_Secure_Boot, aggregated as AzStackHci_Hardware_SecureBoot), documenting where the failure surfaces, how to remediate it in firmware, and the BitLocker precaution/workflow around firmware changes.
Changes:
- Introduces a new Secure Boot TSG covering portal/on-box symptoms, validation commands, remediation steps, and escalation guidance.
- Adds the new TSG to the EnvironmentValidator index README.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md |
New TSG detailing diagnosis/remediation for the Secure Boot validator and BitLocker precautions. |
TSG/EnvironmentValidator/README.md |
Adds an index entry linking to the new Secure Boot TSG. |
…yed-node path Address PR review: the realistic BitLocker case is an already-deployed, encrypted cluster member, so enabling Secure Boot reboots a live node. Add a drain-first step (Suspend-ClusterNode -Drain with cluster-health and quorum pre-checks, one node at a time) and a resume-node step (Resume-ClusterNode, wait for storage resync) around the existing BitLocker suspend/resume flow. Reconcile the Overview scope line to acknowledge the deployed-node path, and cite Suspend-ClusterNode and Resume-ClusterNode. The BitLocker suspend/resume content is unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1008covingtonlane
left a comment
There was a problem hiding this comment.
Re-reviewed at 280334a.
The deployed-node path is now covered. Previously the TSG was scoped purely pre-deployment, but the BitLocker section addressed already-encrypted machines, which are deployed cluster members; rebooting one into firmware to change Secure Boot needed cluster-safety steps that were missing.
This revision adds them:
- Step 1 drains the node first (health pre-checks on
Get-ClusterNode/Get-VirtualDisk/Get-StorageJob, thenSuspend-ClusterNode -Drain), one node at a time. - Step 6 resumes the node and waits for storage resync before the next node.
- The Overview now reconciles the scope (primarily a pre-deployment gate, with the deployed-member drain path called out).
Verified during this pass:
- Step renumbering and every cross-reference are consistent (BitLocker resume references step 2, node resume references step 1, "repeat steps 1 through 6").
- The
Suspend-ClusterNode/Resume-ClusterNodereference links resolve, and only the TSG file changed. - The BitLocker guidance itself is correct:
Suspend-BitLocker -RebootCount 0holds across the firmware change and reboot, and resume reseals to the new measured-boot (PCR 7) state. The UEFI-mode / GPT prerequisite is handled accurately.
No remaining findings. Ready for maintainer review.
…rimary, drain gated) Apply the same framing the TPM Version TSG (Azure#305) landed and that PR Azure#170 captured into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so the machine it flags is a host being validated to become a node, not a deployed member. - BitLocker check/suspend is now the primary step 1, because a host being vetted may have been recycled from a prior project with BitLocker already enabled; a Secure Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of deployment state. - The cluster-drain/quorum steps move into a gated 'If the machine is already a deployed, encrypted cluster member' section instead of being front-loaded as step 1. - Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume BitLocker, plus the gated deployed-member drain section. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1008covingtonlane
left a comment
There was a problem hiding this comment.
Re-reviewed at 6f70f29 (the "reframe for the pre-deployment scenario" commit). This resolves the framing concern cleanly and is a clear improvement for the customer.
The check runs in Deployment / Add Node validation, so the machine is a host being vetted to become a node, and the new structure now matches that:
- BitLocker is the primary step 1, with the right rationale: do it even on a fresh host, because a host can be recycled from a prior project with BitLocker already enabled. That keeps the measured-boot (PCR 7) protection in the main path regardless of deployment state.
- The cluster-drain / quorum steps are moved into a clearly gated "If the machine is already a deployed, encrypted cluster member" section, which is the uncommon case.
- Nice detail: the data-volume example changed from
C:\ClusterStorage\Volume1toD:, correct for a pre-deployment host that has no CSV yet.
Verified: the new section anchor matches its heading, the step renumbering (1 BitLocker, 2 firmware, 3 confirm, 4 resume) and all cross-references are consistent, and the forward-pointer from step 2 to the deployed-member section is correct. No remaining findings. Ready for maintainer review.
Resolve the reviewer note: the AzStackHciEnvironmentChecker event can carry either the per-check name (AzStackHci_Hardware_Test_Secure_Boot) or the aggregated name (AzStackHci_Hardware_SecureBoot), so the Get-WinEvent filter now matches both, mirroring the SystemDrive Free Space TSG (AzStackHci_Hardware_(Test_SystemDrive_Free_Space|SystemDriveFreeSpace)). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…f checklist From the 10-persona usability read, the single highest-leverage change: a 'Before you start' box ahead of 'How to fix it' that (a) routes ownership (server/hardware admin with BMC does the firmware change; Windows admin confirms BitLocker; network provides BMC access only; first-line staff stop), and (b) consolidates the must-confirm proof points before the firmware reboot (BitLocker recovery key escrowed, machine is UEFI + GPT not legacy/MBR, and drain first if this is any deployed cluster member). Resolves the majority of the personas' 'wants improved' (who-should-do-this, proof-point checklist, boot-mode/GPT check, stop-and-hand-off, any-deployed-member, hard stop). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… menu pointer table Two reviewer asks (network-engineer and OEM-field personas): - Cross-check the firmware boot mode from Windows with bcdedit (winload.efi = UEFI, winload.exe = legacy), alongside Get-Disk PartitionStyle and Confirm-SecureBootUEFI. - Add a starting-point table for where Secure Boot and boot mode live per vendor (Dell iDRAC/BIOS, HPE iLO/RBSU, Lenovo XClarity/UEFI), with an explicit caveat that paths vary by model and firmware version, so confirm against current vendor docs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a public Environment Validator troubleshooting guide for the Hardware Secure Boot check (
AzStackHci_Hardware_Test_Secure_Boot, aggregatedAzStackHci_Hardware_SecureBoot). The check runsTest-SecureBoot/Confirm-SecureBootUEFIand is Critical (it blocks deployment until UEFI Secure Boot is enabled on the machine).What the TSG covers
Invoke-AzStackHciHardwareValidation -Include Test-SecureBoot), and theAzStackHciEnvironmentCheckerevent log (Event ID 17205), with the real failure detail.Suspend-BitLocker -RebootCount 0before the change andResume-BitLockerafter.Validation
The check was validated end to end on a live Azure Local (masonenode) lab cluster: injecting Secure Boot OFF made the validator return FAILURE, and enabling Secure Boot returned it to SUCCESS.
Indexed in the EnvironmentValidator README.