Skip to content

Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#304

Open
1008covingtonlane wants to merge 6 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-secure-boot
Open

Add TSG: Secure Boot (AzStackHci_Hardware_Test_Secure_Boot)#304
1008covingtonlane wants to merge 6 commits into
Azure:mainfrom
1008covingtonlane:tsg-hardware-secure-boot

Conversation

@1008covingtonlane

Copy link
Copy Markdown
Collaborator

Summary

Adds a public Environment Validator troubleshooting guide for the Hardware Secure Boot check (AzStackHci_Hardware_Test_Secure_Boot, aggregated AzStackHci_Hardware_SecureBoot). The check runs Test-SecureBoot / Confirm-SecureBootUEFI and is Critical (it blocks deployment until UEFI Secure Boot is enabled on the machine).

What the TSG covers

  • Where the failure appears: the Azure portal Validation step, the on-box single validator (Invoke-AzStackHciHardwareValidation -Include Test-SecureBoot), and the AzStackHciEnvironmentChecker event log (Event ID 17205), with the real failure detail.
  • How to fix it: enable UEFI Secure Boot in firmware (including the UEFI-mode / GPT prerequisite and the standard-keys note), then re-validate.
  • BitLocker precaution: a Secure Boot change is measured into TPM PCR 7, so on a machine with BitLocker enabled the next boot stops at the recovery screen. Azure Local enables data-at-rest encryption by default, so the TSG documents Suspend-BitLocker -RebootCount 0 before the change and Resume-BitLocker after.

Validation

The check was validated end to end on a live Azure Local (masonenode) lab cluster: injecting Secure Boot OFF made the validator return FAILURE, and enabling Secure Boot returned it to SUCCESS.

Indexed in the EnvironmentValidator README.

New Environment Validator troubleshooting guide for the Hardware Secure Boot check
(Test-SecureBoot, which runs Confirm-SecureBootUEFI). Covers where the failure appears
(portal, on-box validator, and the AzStackHciEnvironmentChecker event log), how to enable
UEFI Secure Boot in firmware, and a BitLocker precaution: suspend BitLocker before the
firmware change and resume after, since Azure Local enables data-at-rest encryption by
default and a Secure Boot change is measured into TPM PCR 7. Indexed in the
EnvironmentValidator README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Environment Validator troubleshooting guide (TSG) for the Azure Local Secure Boot hardware validation check (AzStackHci_Hardware_Test_Secure_Boot, aggregated as AzStackHci_Hardware_SecureBoot), documenting where the failure surfaces, how to remediate it in firmware, and the BitLocker precaution/workflow around firmware changes.

Changes:

  • Introduces a new Secure Boot TSG covering portal/on-box symptoms, validation commands, remediation steps, and escalation guidance.
  • Adds the new TSG to the EnvironmentValidator index README.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
TSG/EnvironmentValidator/Troubleshooting-Hardware-Test-Secure-Boot.md New TSG detailing diagnosis/remediation for the Secure Boot validator and BitLocker precautions.
TSG/EnvironmentValidator/README.md Adds an index entry linking to the new Secure Boot TSG.

…yed-node path

Address PR review: the realistic BitLocker case is an already-deployed, encrypted cluster
member, so enabling Secure Boot reboots a live node. Add a drain-first step (Suspend-ClusterNode
-Drain with cluster-health and quorum pre-checks, one node at a time) and a resume-node step
(Resume-ClusterNode, wait for storage resync) around the existing BitLocker suspend/resume flow.
Reconcile the Overview scope line to acknowledge the deployed-node path, and cite Suspend-ClusterNode
and Resume-ClusterNode. The BitLocker suspend/resume content is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at 280334a.

The deployed-node path is now covered. Previously the TSG was scoped purely pre-deployment, but the BitLocker section addressed already-encrypted machines, which are deployed cluster members; rebooting one into firmware to change Secure Boot needed cluster-safety steps that were missing.

This revision adds them:

  • Step 1 drains the node first (health pre-checks on Get-ClusterNode / Get-VirtualDisk / Get-StorageJob, then Suspend-ClusterNode -Drain), one node at a time.
  • Step 6 resumes the node and waits for storage resync before the next node.
  • The Overview now reconciles the scope (primarily a pre-deployment gate, with the deployed-member drain path called out).

Verified during this pass:

  • Step renumbering and every cross-reference are consistent (BitLocker resume references step 2, node resume references step 1, "repeat steps 1 through 6").
  • The Suspend-ClusterNode / Resume-ClusterNode reference links resolve, and only the TSG file changed.
  • The BitLocker guidance itself is correct: Suspend-BitLocker -RebootCount 0 holds across the firmware change and reboot, and resume reseals to the new measured-boot (PCR 7) state. The UEFI-mode / GPT prerequisite is handled accurately.

No remaining findings. Ready for maintainer review.

…rimary, drain gated)

Apply the same framing the TPM Version TSG (Azure#305) landed and that PR Azure#170 captured
into the harness skill. Test-SecureBoot runs in the Hardware validator (Deployment
and Add Node) and the readiness/bootstrap set, with no upgrade-renamed variant, so
the machine it flags is a host being validated to become a node, not a deployed
member.

- BitLocker check/suspend is now the primary step 1, because a host being vetted may
  have been recycled from a prior project with BitLocker already enabled; a Secure
  Boot change (TPM PCR 7) would trip an encrypted volume into recovery regardless of
  deployment state.
- The cluster-drain/quorum steps move into a gated 'If the machine is already a
  deployed, encrypted cluster member' section instead of being front-loaded as step 1.
- Steps are now: 1 check/suspend BitLocker, 2 enable Secure Boot, 3 confirm, 4 resume
  BitLocker, plus the gated deployed-member drain section.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@1008covingtonlane 1008covingtonlane left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at 6f70f29 (the "reframe for the pre-deployment scenario" commit). This resolves the framing concern cleanly and is a clear improvement for the customer.

The check runs in Deployment / Add Node validation, so the machine is a host being vetted to become a node, and the new structure now matches that:

  • BitLocker is the primary step 1, with the right rationale: do it even on a fresh host, because a host can be recycled from a prior project with BitLocker already enabled. That keeps the measured-boot (PCR 7) protection in the main path regardless of deployment state.
  • The cluster-drain / quorum steps are moved into a clearly gated "If the machine is already a deployed, encrypted cluster member" section, which is the uncommon case.
  • Nice detail: the data-volume example changed from C:\ClusterStorage\Volume1 to D:, correct for a pre-deployment host that has no CSV yet.

Verified: the new section anchor matches its heading, the step renumbering (1 BitLocker, 2 firmware, 3 confirm, 4 resume) and all cross-references are consistent, and the forward-pointer from step 2 to the deployed-member section is correct. No remaining findings. Ready for maintainer review.

1008covingtonlane and others added 3 commits June 26, 2026 15:07
Resolve the reviewer note: the AzStackHciEnvironmentChecker event can carry
either the per-check name (AzStackHci_Hardware_Test_Secure_Boot) or the
aggregated name (AzStackHci_Hardware_SecureBoot), so the Get-WinEvent filter
now matches both, mirroring the SystemDrive Free Space TSG
(AzStackHci_Hardware_(Test_SystemDrive_Free_Space|SystemDriveFreeSpace)).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…f checklist

From the 10-persona usability read, the single highest-leverage change: a 'Before you start'
box ahead of 'How to fix it' that (a) routes ownership (server/hardware admin with BMC does
the firmware change; Windows admin confirms BitLocker; network provides BMC access only;
first-line staff stop), and (b) consolidates the must-confirm proof points before the firmware
reboot (BitLocker recovery key escrowed, machine is UEFI + GPT not legacy/MBR, and drain first
if this is any deployed cluster member). Resolves the majority of the personas' 'wants improved'
(who-should-do-this, proof-point checklist, boot-mode/GPT check, stop-and-hand-off, any-deployed-member, hard stop).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… menu pointer table

Two reviewer asks (network-engineer and OEM-field personas):
- Cross-check the firmware boot mode from Windows with bcdedit (winload.efi = UEFI,
  winload.exe = legacy), alongside Get-Disk PartitionStyle and Confirm-SecureBootUEFI.
- Add a starting-point table for where Secure Boot and boot mode live per vendor
  (Dell iDRAC/BIOS, HPE iLO/RBSU, Lenovo XClarity/UEFI), with an explicit caveat that
  paths vary by model and firmware version, so confirm against current vendor docs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants