Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TSG/EnvironmentValidator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This folder contains the TSG's related to Environment Validators.

* [Troubleshooting External Connectivity Failures in Environment Checker](./Troubleshooting-External-Connectivity-Failures-in-Environment-Checker.md)
* [Troubleshooting Connectivity Test DNS](./Troubleshooting-Connectivity-Test-Dns.md)
* [Troubleshooting Test NetAdapter API Failure](./Troubleshooting-Test-NetAdapter-API.md)
* [Troubleshooting Test PhysicalDisk API Failure](./Troubleshooting-Test-PhysicalDisk-API.md)
* [Troubleshooting Test System Drive Free Space](./Troubleshooting-Test-SystemDrive-Free-Space.md)
Expand Down
365 changes: 365 additions & 0 deletions TSG/EnvironmentValidator/Troubleshooting-Connectivity-Test-Dns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
# AzStackHci_Connectivity_Test_Dns

> **At a glance**
> - **Owner:** the customer's network or DNS administrator. This is not a Microsoft software defect and not an OEM hardware or firmware issue.
> - **Impact:** Critical. It blocks Azure Local deployment and updates until external DNS resolution works on every node.
> - **Effort and downtime:** small. A per-node DNS change applies immediately, with no reboot, no cluster drain, and no impact to running VMs or live migration.
> - **Typical time to resolve:** about 15 to 30 minutes per affected node for a DNS-client fix once you have the correct DNS server addresses. Allow longer if the fix is an upstream DNS server change (a forwarder or firewall rule) that must be coordinated with whoever owns that server.
> - **Before you change anything:** do not guess DNS server IP addresses. Get the cluster's intended DNS servers from your network or DNS administrator first.

## Overview

This Environment Validator check confirms that each Azure Local node can resolve an
external (public) DNS name. On every node, for each DNS server configured on every
network adapter that is up, the check resolves the public name `microsoft.com` and
expects at least one A record back. If any configured DNS server returns no records
(or no DNS server is configured at all), the check fails for that node.

- **Severity:** Critical. When this check fails on a node and no proxy is in use,
the validator stops the remaining connectivity tests for that node, so a single
DNS failure can also hide other connectivity findings.
- **When it runs:** pre-deployment readiness, deployment, add-node, and the
pre-update health check. In practice you will most often see it block a pending
Azure Local update.
- **Newer builds:** on recent Azure Local builds this external-DNS test was moved
into a dedicated DNS validator and is reported under one of two names,
`AzStackHci_DNS_ExternalDnsResolution` or
`AzStackHci_DNS_Test_External_Hostname_Resolution` (both are in use across current
builds, so search the health-check results for either). The dedicated validator
resolves `management.azure.com` rather than `microsoft.com` and retries before it
fails, so its `Detail` adds an `(Attempt: n/3)` suffix and lists each failing node
as its own bullet. The cause and the fix in this guide are the same; only the
validator name and the queried hostname differ.

**Who owns this fix.** This is a customer network and DNS configuration check. The
owner is the customer's network or DNS administrator. It is not a Microsoft software
defect, and it is not an OEM hardware or firmware issue, so it does not require a
hardware vendor or a Microsoft product fix. A Microsoft support engineer can guide
the customer through it, but the change itself is made in the customer's DNS
infrastructure or in a node's network configuration.

**Where a node's DNS comes from.** A node's DNS servers are set at deployment time from
the deployment configuration's management network settings and applied to the management
network adapter; they are not baked into the OEM factory image. If you are an OEM or
field engineer checking your own imaging process, confirm the image does not pin DNS
servers and leaves them to be set by deployment, so each cluster picks up the customer's
intended DNS rather than a stale value carried over from imaging.

## Requirements

- Administrative (local administrator) access to each Azure Local node, or a remote
PowerShell session to the nodes.
- The list of DNS servers the cluster is supposed to use, from the deployment's
network configuration.
- Access to, or coordination with, whoever administers those DNS servers, in case an
upstream server needs a forwarder or an external-resolution fix.
- No maintenance window is required. A node DNS-client change applies immediately and
does not need a reboot or a cluster drain.

## Troubleshooting Steps

### 1. Confirm the failure and see where it appears

The same failure surfaces in several places depending on how it was noticed. Pick the
entry point that matches; they all converge on the same `Detail` string.

#### Option A: Health-check result files on the cluster shared volume (recommended)

Every pre-update health check writes one JSON result file to the cluster's
infrastructure share. Read the newest one and filter to this check:

```powershell
$base = 'C:\ClusterStorage\Infrastructure_1\Shares\SU1_Infrastructure_1\Updates\HealthCheck\System'
if (-not (Test-Path $base)) {
# Fallback: walk all ClusterStorage volumes for the HealthCheck folder.
$base = Get-ChildItem 'C:\ClusterStorage' -Directory -ErrorAction SilentlyContinue |
ForEach-Object { Join-Path $_.FullName 'Shares\SU1_Infrastructure_1\Updates\HealthCheck\System' } |
Where-Object { Test-Path $_ } | Select-Object -First 1
}

$latest = $null
if ($base) {
$latest = Get-ChildItem $base -Filter 'HealthCheckResult.*.json' -ErrorAction SilentlyContinue |
Sort-Object LastWriteTime -Descending | Select-Object -First 1
}

if (-not $latest) {
Write-Warning "No HealthCheck result file found on this node (the folder is missing or no health check has run yet). Use Option B, C, or D, or run this on a different node."
}
else {
Write-Host "Reading: $($latest.FullName)"
Get-Content $latest.FullName -Raw | ConvertFrom-Json |
Where-Object { $_.Name -eq 'AzStackHci_Connectivity_Test_Dns' -and $_.Status -ne 0 -and $_.Status -ne 'SUCCESS' } |
Select-Object Severity,
@{ n = 'Source'; e = { $_.AdditionalData.Source } },
@{ n = 'Detail'; e = { $_.AdditionalData.Detail } },
Remediation
}
```

Each row is one currently-failing DNS server on one node. The `Detail` column is the
precise error string; the `Remediation` column points at the public
[Azure Local network requirements](https://learn.microsoft.com/azure/azure-local/concepts/firewall-requirements)
documentation.

#### Option B: `Get-SolutionUpdate` (is an update being blocked?)

If the failure was noticed because a pending update will not start, this is the
fastest confirmation:

```powershell
Get-SolutionUpdate |
Select-Object DisplayName, Version, State, HealthCheckResult, HealthCheckDate |
Format-Table -AutoSize
```

`HealthCheckResult = Failure` with a recent `HealthCheckDate` means the pre-update
validators failed. Use Option A to see which validator caused it.

#### Option C: Windows event log (when the result files are missing)

The same data is written to the Windows event log on each node. In Event Viewer, open
`Applications and Services Logs` then `AzStackHciEnvironmentChecker` and filter for
Event ID 17205, or from PowerShell:

```powershell
Get-WinEvent -LogName AzStackHciEnvironmentChecker -FilterXPath "*[System[(EventID=17205)]]" |
ForEach-Object {
try { $r = $_.Message | ConvertFrom-Json } catch { return }
if ($r.Name -eq 'AzStackHci_Connectivity_Test_Dns' -and $r.Status -ne 0 -and $r.Status -ne 'SUCCESS') {
[pscustomobject]@{
TimeCreated = $_.TimeCreated
Source = $r.AdditionalData.Source
Detail = $r.AdditionalData.Detail
}
}
} | Sort-Object TimeCreated -Descending | Select-Object -First 20
```

#### Option D: Azure portal

In the Azure portal, open the Azure Local cluster, then the **Updates** tab. When a
pre-update health check fails, the portal shows a banner naming the failing
validators. `Test DNS` (the display name of this check) appearing there is the same
failure.

### 2. What it looks like: example failure signatures

The check emits one of these `Detail` strings per failing DNS server (the IP address,
the node name, and the record count vary):

```
Queried dns server 10.0.0.10 for microsoft.com on AzL-Node-01. Result returned 0 A records. Expected at least 1.
```

```
No DNS server configured
```

The first signature means the node reached the DNS server at that IP, but the server
returned no A records for the external name `microsoft.com`. The second means the
node's up adapters have no DNS server configured at all.

A passing node, for contrast, reports a count of one or more and lists the resolved
addresses, for example `Result returned 1 A records: <address>, expected at least 1.`

> If a proxy is configured on a node (WinHTTP proxy), the check is skipped on that
> node and reported as success, with a `Detail` of
> `Skipping DNS resolution test on <node> because a proxy is configured.` That is
> expected behavior, not a failure.

On builds that use the dedicated DNS validator (see "Newer builds" in the overview),
the same failure reads slightly differently: it resolves `management.azure.com`,
retries up to three times, and lists each failing node as its own bullet, for example:

```
- AzL-Node-01
- Queried dns server 10.0.0.10 for management.azure.com on AzL-Node-01 (Attempt: 3/3). Result returned 0 A records. Expected at least 1. Error:
```

The meaning is the same as the first signature above (the server was reached but
returned no A records); only the queried hostname, the `(Attempt: n/3)` suffix, and
the per-node bullet layout differ.

### 3. Identify the affected nodes

Run across all nodes to see exactly which ones are failing and against which DNS
server:

```powershell
Invoke-Command -ComputerName (Get-ClusterNode).Name -ScriptBlock {
$base = 'C:\ClusterStorage\Infrastructure_1\Shares\SU1_Infrastructure_1\Updates\HealthCheck\System'
if (-not (Test-Path $base)) {
# Same fallback as Step 1 Option A: walk all ClusterStorage volumes.
$base = Get-ChildItem 'C:\ClusterStorage' -Directory -ErrorAction SilentlyContinue |
ForEach-Object { Join-Path $_.FullName 'Shares\SU1_Infrastructure_1\Updates\HealthCheck\System' } |
Where-Object { Test-Path $_ } | Select-Object -First 1
}
$latest = $null
if ($base) {
$latest = Get-ChildItem $base -Filter 'HealthCheckResult.*.json' -ErrorAction SilentlyContinue |
Sort-Object LastWriteTime -Descending | Select-Object -First 1
}
if (-not $latest) {
# Emit an explicit NO DATA row so this node is never silently treated as passing.
return [pscustomobject]@{ Detail = 'NO DATA: no HealthCheck result file on this node (use Option C, the event log)' }
}
$failing = Get-Content $latest.FullName -Raw | ConvertFrom-Json |
Where-Object { $_.Name -eq 'AzStackHci_Connectivity_Test_Dns' -and $_.Status -ne 0 -and $_.Status -ne 'SUCCESS' } |
Select-Object @{ n = 'Detail'; e = { $_.AdditionalData.Detail } }
if ($failing) { $failing }
else { [pscustomobject]@{ Detail = 'PASS: no failing DNS result in the latest health check' } }
} | Sort-Object PSComputerName | Select-Object PSComputerName, Detail
```

Every node now reports one of three things: a failing `Detail` (the node is failing,
fix it), `PASS` (the check passed on that node), or `NO DATA` (the result could not be
read on that node, so confirm it with Option C rather than assuming it passed).

### 4. Consequences if you do not fix this

This check is Critical. When it fails on a node and no proxy is configured, the
validator stops the remaining connectivity tests for that node, so one DNS failure
can mask other connectivity problems and fails the pre-update health check overall.
Concretely:

- A pending Azure Local update or a deployment that runs this readiness check will not
proceed until external DNS resolution succeeds on every node.
- External name resolution is required for the cluster to reach Azure for Arc,
updates, billing, and telemetry. The cluster keeps running local workloads, but its
cloud-managed lifecycle is impaired until external DNS works.

### 5. Remediation

The check fails when a DNS server configured on a node cannot resolve the external
name `microsoft.com`. The fix is a customer-side DNS change, either on the node's
DNS-client configuration or on the upstream DNS server. Work through this on each node
identified in Step 3.

**Most common fix (start here).** The usual cause is a node pointed at a DNS server that
cannot resolve external names. Re-point that node's management adapter at a DNS server
that can (step 2 below), or add an external-resolving forwarder on the current server
(step 4 below). The numbered steps confirm which of these applies; most failures are
resolved by one of the two.

_New to any DNS term used below (A record, forwarder, split-horizon, WinHTTP proxy)? See
the [Glossary](#glossary) at the end of this guide._

1. List the DNS servers currently configured on the node:

```powershell
Get-DnsClientServerAddress -AddressFamily IPv4 |
Where-Object ServerAddresses |
Select-Object InterfaceAlias, @{ n = 'DnsServers'; e = { $_.ServerAddresses -join ', ' } }
```

2. Confirm these are the DNS servers the cluster is supposed to use, comparing against
your documented management DNS servers. If the node has no DNS server on its
management adapter (the `No DNS server configured` signature), or the configured
servers are wrong, set the correct ones (per node, applies immediately, no reboot).

First identify which adapter is the management adapter, so the placeholders below are
concrete. It is the up adapter whose IPv4 address is the node's management IP; match
that IP to an `InterfaceAlias` here, then reuse the same `InterfaceAlias` from Step 1
to see the DNS servers currently on it:

```powershell
Get-NetIPConfiguration | Where-Object { $_.IPv4Address } |
Select-Object InterfaceAlias, @{ n = 'IPv4'; e = { $_.IPv4Address.IPAddress -join ', ' } }
```

The correct `<dns1>`,`<dns2>` are your deployment's documented management DNS servers
(the same ones the healthy nodes resolve against). Record the original values first so
the change can be rolled back, then set them on the management adapter:

```powershell
Set-DnsClientServerAddress -InterfaceAlias '<ManagementAdapter>' -ServerAddresses '<dns1>','<dns2>'
```

3. Test each configured DNS server the same way the validator does, resolving the
external name directly against that server:

```powershell
foreach ($dns in ((Get-DnsClientServerAddress -AddressFamily IPv4).ServerAddresses | Sort-Object -Unique)) {
$count = (Resolve-DnsName -Name microsoft.com -Server $dns -Type A -DnsOnly -QuickTimeout -ErrorAction SilentlyContinue).Count
'{0}: {1} A record(s)' -f $dns, ([int]$count)
}
```

A server reporting `0 A record(s)` is the failing one: the node reaches it, but it
cannot resolve external names.

4. Fix the failing DNS server, choosing the option that matches the environment:

- If the configured server is wrong or stale, re-point the node at a DNS server
that can resolve external names (Step 2).
- If the configured server is correct but internal-only, add a forwarder on that
DNS server to a resolver that can answer external queries, or otherwise enable
external resolution on it. This change is made on the DNS server, not on the
Azure Local node, so coordinate with whoever owns that server.
- If the server resolves internal names but returns nothing for the external name,
an internal-only or split-horizon DNS zone may be shadowing external resolution;
add a forwarder or otherwise enable external resolution as above.
- Confirm that DNS traffic on port 53 from the nodes to the DNS servers is not
blocked by a firewall.

5. If the cluster intentionally has no direct outbound name resolution and uses a
proxy for all outbound traffic, configure the WinHTTP proxy on each node. When a
proxy is present, this check self-skips and reports success. Only do this if a
proxy is genuinely part of the design.

**Risk:** LOW for re-pointing a node's DNS client, which is per-node, immediate, and
reversible by restoring the previous servers. MEDIUM for changes on an upstream DNS
server, which can affect other systems that use it, so coordinate with its owner. No
node drain or reboot is required for DNS-client changes.

### 6. Verification: prove the failure cleared

Re-run the pre-update health check. This writes a fresh result file to the cluster
shared volume and fresh event-log entries on every node:

```powershell
Invoke-SolutionUpdatePrecheck
```

This typically takes several minutes depending on cluster size. When it finishes,
re-run any option from Step 1. The check should return no failing rows. You can also
confirm the underlying resolution directly on each node:

```powershell
Resolve-DnsName -Name microsoft.com -Type A
```

A result containing one or more A records means external DNS resolution is working. If
Step 1 still shows failures, re-read the new `Detail` text: the failure may have moved
to a different DNS server or a different node that needs the same fix.

> **Note:** the Azure portal readiness view and the cluster-wide `HealthCheckResult`
> file refresh only when a full health check or `Invoke-SolutionUpdatePrecheck` runs, not
> on a targeted per-node re-test. The portal can therefore lag a just-applied fix until
> the next precheck or the periodic (roughly daily) health check, so confirm the fix
> on-node with `Resolve-DnsName` rather than waiting on the portal.

## Glossary

Plain-language definitions of the DNS terms used in this guide. Experienced readers can
skip this section; it is here so the steps above stay short.

- **A record:** the basic DNS record that maps a name (such as `microsoft.com`) to an
IPv4 address. This check passes only when a configured DNS server returns at least one
A record for the external name.
- **DNS server / resolver:** the server a node asks to turn a name into an address. Each
node lists one or more on its network adapters; this check tests each one.
- **Forwarder:** a setting on a DNS server that hands off queries it cannot answer itself
(such as external or public names) to another resolver that can. An internal-only DNS
server usually needs a forwarder to resolve external names.
- **Conditional forwarder:** a forwarder that applies only to a specific domain, so a
server can send just some queries (for example external names) to a particular
resolver.
- **Split-horizon (split-brain) DNS:** a setup where the same DNS name resolves
differently for internal versus external clients. An internal-only zone can shadow an
external name, so the server answers internal lookups but returns nothing for the
public name this check asks for.
- **WinHTTP proxy:** a system-level outbound proxy configured on a node. When one is set,
the node routes outbound traffic through it, and this DNS check self-skips on that node
and reports success.