Skip to content

WIP: Application load balancer controller#3

Open
fischerman wants to merge 83 commits into
mainfrom
albc
Open

WIP: Application load balancer controller#3
fischerman wants to merge 83 commits into
mainfrom
albc

Conversation

@fischerman

@fischerman fischerman commented Jun 24, 2026

Copy link
Copy Markdown
Member

How to categorize this PR?

/kind enhancement

What this PR does / why we need it:

Implements the first alpha version of the STACKIT ALB controller.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Breaking changes:

@ske-prow ske-prow Bot added do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 24, 2026
@nschad nschad changed the title Draft: Application load balancer controller WIP: Application load balancer controller Jun 25, 2026
@ske-prow ske-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2026
@nschad

nschad commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@fischerman either mark the PR as "Draft" or use "WIP" as the title. The PR will then automatically marked as WIP/Draft

@ske-prow

ske-prow Bot commented Jun 25, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dergeberl for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ske-prow ske-prow Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2026
@ske-prow ske-prow Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2026
Comment thread .golangci.yaml
Comment thread pkg/controller/ingress/setup.go Outdated

@dergeberl dergeberl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first batch of review comments. Only had a look into the /cmd/*, /pkg/controller/* (except the tests) and the *.md files, yet. I will continue with my review for the rest.

Maybe some comments are already resolved as you also pushed during my review (I did not crosscheck again, before submitting the comments now).

Comment thread docs/user.md Outdated
Comment thread docs/user.md Outdated
Comment thread docs/user.md Outdated
Comment thread docs/user.md
| `alb.stackit.cloud/traget-pool-tls-skip-certificate-validation`| Boolean | IngressClass, Ingress, Service | Optional | Enables TLS bridging but skips certificate validation. |
| `alb.stackit.cloud/allowed-source-ranges`| String | IngressClass | Accepts a comma-separated list of IP ranges. E.g. 10.0.0.0/24,1.2.3.4/32. If unset, all IPs are allowed. |

### Known Limitations

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we add the NodePort requirement here? Also there are limits in the amount of resources (like listeners) which we should add here (or somewhere else).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limits like listeners, targetpools and path should also be documented. I remeber you having a list what the limits. We should also document how the this translate to ingress objects, like for the listeners it is the ports.

Comment thread cmd/application-load-balancer-controller/main.go Outdated
Comment thread docs/user.md
### Limits

The following limitations are imposed directly by the STACKIT ALB API (not the controller itself):
- Maximum targets per pool: An individual target pool can contain a maximum of 250 targets.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit should be 1000 or?

Comment thread docs/user.md

#### When to watch out for the listener limit

Because each IngressClass provisions a dedicated ALB instance, hitting the 20-listener threshold is rarely an issue for a basic setup but becomes a real risk when you start stacking custom ports across multiple applications sharing that same ALB. If your Ingress resources use the `alb.stackit.cloud/http-port` or `alb.stackit.cloud/https-port` annotations to expose different apps on unique custom port numbers, each distinctive port allocates its own listener on the shared ALB instance. This risk compounds quickly when those applications also require TLS encryption; since the controller must keep an extra HTTP listener active alongside the HTTPS listener to smoothly process automated ACME certificate challenges, a single secure app immediately consumes two slots instead of one, accelerating how fast you approach the API limit if multiple unique custom ports are configured.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rephrase. The ALB per ingress does not make it rarely to hit the limit (It is the default of 2 ports).

Comment thread docs/user.md

A "target" in a pool corresponds directly to a worker node in your cluster. If you run a large cluster with a high number of worker nodes, or expect your cluster to dynamically scale to a large size, keep this limit in mind since a single backend Service port mapping cannot route traffic to more than 250 worker nodes simultaneously.

#### When to watch out for the listener limit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also something for the Known Limitations.

Comment thread docs/user.md

Because each IngressClass provisions a dedicated ALB instance, hitting the 20-listener threshold is rarely an issue for a basic setup but becomes a real risk when you start stacking custom ports across multiple applications sharing that same ALB. If your Ingress resources use the `alb.stackit.cloud/http-port` or `alb.stackit.cloud/https-port` annotations to expose different apps on unique custom port numbers, each distinctive port allocates its own listener on the shared ALB instance. This risk compounds quickly when those applications also require TLS encryption; since the controller must keep an extra HTTP listener active alongside the HTTPS listener to smoothly process automated ACME certificate challenges, a single secure app immediately consumes two slots instead of one, accelerating how fast you approach the API limit if multiple unique custom ports are configured.

### Configuration

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that this are all, WAF seems to be missing

Comment thread README.md

@dergeberl dergeberl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for the stackit package and testutil

Comment thread pkg/stackit/config/config.go Outdated
return ctrl.Result{}, fmt.Errorf("failed to get load balancer: %w", err)
}

if *alb.Status != stackit.LBStatusReady {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest SDK (v0.15.0) there are const for that LOADBALANCERSTATUS_STATUS_READY.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some other changes that come with the update of the SDK. Can we postpone this to after the merge?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based in the changelog the new release only added the enums. The change should be minimal and only be a search and replace of the existing enums (as the prefix renamed). So not sure if it worth to postpone this.

https://github.com/stackitcloud/stackit-sdk-go/blob/main/services/alb/CHANGELOG.md

Comment thread pkg/stackit/applicationloadbalancer.go Outdated
type ProjectStatus string

const (
LBStatusReady = "STATUS_READY"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use the LOADBALANCERSTATUS_STATUS_READY from the SDK

Comment thread pkg/stackit/error.go
Comment thread pkg/stackit/suite_test.go Outdated
Comment thread pkg/stackit/error.go
@@ -0,0 +1,22 @@
package stackit

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pkg/testutil/service/service.go Outdated
Comment thread pkg/testutil/ingress/ingress.go Outdated
Comment thread pkg/controller/ingress/spec/worktree.go Outdated
internalLB bool
externalIP string

listeners map[int16]*workTreeListener

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int16 is not big enough for ports. It is later converted again to int32. Let's enable the gosec lint (this is how I found this error)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the type to uin16 and added out of range checks.

@breuerfelix breuerfelix Jun 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also add gosec to the golangci-lint

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in fc1c542

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fc1c542 does not add the gosec to the golangci-lint config

Comment thread cmd/application-load-balancer-controller/main.go Outdated
Comment thread pkg/controller/ingress/ingressclass_controller.go Outdated

if *alb.Status != stackit.LBStatusReady {
// ALB is not yet ready, requeue
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we write an event for this? In NLB controller we inform the user about this state via event

hosts := []albsdk.HostConfig{}
for hostname, host := range listener.hosts {
paths := slices.Collect(maps.Values(host.paths))
typeRank := map[networkingv1.PathType]int{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is static, can this live as a global variable?

networkingv1.PathTypeImplementationSpecific: 2,
networkingv1.PathTypePrefix: 3,
}
slices.SortFunc(paths, func(a, b *workTreePath) int {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have this sort function as a standalone instead of inlined in this for loop.

Path: new("/healthz"),
OkStatuses: []string{"200"},
},
HealthyThreshold: new(int32(1)),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be constants for these defaults to easily find them in the code.

})
}

if len(listeners) == 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels weird to me. If the load balancer does not yet contain any port, we shouldn't create it at all.

Comment thread pkg/controller/ingress/spec/worktree.go Outdated
DisableTargetSecurityGroupAssignment: new(true), // TODO: Make this configurable via flag.
Name: new(fmt.Sprintf("k8s-ingress-%s", t.ingressClass.UID)),
Labels: &map[string]string{
"ingress-class-uid": string(t.ingressClass.UID),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SanitizeLabelValue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from a user perspective it might be useful to add a cluster name label or something likewise here, so one can easily see which alb is for which cluster.

@hown3d hown3d Jun 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file should be called reconcile.go or smth like that, has nothing to do with updating things.
Not sure in general why there are two files with the reconciler code.

duplicateCerts := []string{}
for _, cert := range ingressClassCertificates {
if id, exists := certIDMap[spec.CertificateFingerprint(*cert.Data.FingerprintSha256)]; exists {
duplicateCerts = append(duplicateCerts, id)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we add the id from the certIDMap map which would result in deleting the certs that want to keep.

Suggested change
duplicateCerts = append(duplicateCerts, id)
duplicateCerts = append(duplicateCerts, cert.Id)


if ptr.Deref(c.Name, "") != ptr.Deref(d.Name, "") ||
ptr.Deref(c.TargetPort, 0) != ptr.Deref(d.TargetPort, 0) ||
len(c.Targets) != len(d.Targets) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A len check is not enough here.

return len(c.CertificateConfig.CertificateIds) != len(d.CertificateConfig.CertificateIds)
}

func targetPoolsChanged(current, desired []albsdk.TargetPool) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misses activeHealthCheck

}
for i := range ingressClassCertificates {
cert := &ingressClassCertificates[i]
if err := r.CertificateClient.DeleteCertificate(ctx, r.ALBConfig.Global.ProjectID, r.ALBConfig.Global.Region, *cert.Id); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an async call? If yes we should wait until certs are gone before removing the finalizer.

Comment thread pkg/controller/ingress/suite_test.go
region string,
) *albsdk.CreateLoadBalancerPayload {
listeners := []albsdk.Listener{}
for port, listener := range t.listeners {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map interactions are not deterministic. This will lead to issues while compare if a change is needed in

func listenersChanged(current, desired []albsdk.Listener) bool {

CertificateIds: []string{},
},
}
for fingerprint, cert := range t.certificates {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map interactions are not deterministic. If I remember correctly the order in the ALB matters, as always the first match will be used. This could lead in a flapping which cert is used if multiple are fitting.

WithRule("my-host.local", WithPath("/", new(networkingv1.PathTypePrefix), service.Name, networkingv1.ServiceBackendPort{Number: 80})),
)
testutil.CreateKubernetesResourceAndDeferDeletion(ctx, k8sClient, &ingress)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also shpuld check that the IP is set in the status of the ingress.

Comment thread pkg/controller/ingress/update_test.go Outdated
expected: true,
},
{
name: "https certificates changed",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add the same but with also 2 Ids.

Comment thread pkg/controller/ingress/update_test.go Outdated
expected: true,
},
{
name: "target pool tls validation changed",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add targets itself and test.

@@ -0,0 +1,302 @@
package ingress

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the update.go are multiple functions which also should be tested, like getServicesForIngresses, getTLSSecretsFromIngresses, getCertificatesForIngressClass. But also the logic for duplicateCerts should get tested.

Comment thread pkg/controller/ingress/update_test.go Outdated
)

Expect(errs).To(HaveLen(1))
Expect(errs[0].Description).To(Equal("invalid certificate: tls: failed to find any PEM data in certificate input"))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in test should return an error when the TLS secret isn't of type TLS

Comment thread pkg/controller/ingress/spec/worktree_test.go
Comment thread pkg/controller/ingress/spec/worktree.go Outdated

tree := &WorkTreeALB{
ingressClass: ingressClass,
planID: GetAnnotation(AnnotationPlanID, "", ingressClass),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that we only get working PlanIDs. For NLBs we check that in cloud-provider-stackit

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ties together with https://github.com/stackitcloud/application-load-balancer-controller/pull/3/changes#r3511950744

However, this does not only break an ingress but the entire ALB. Because we introduce the webhook I would just return an error here and not process any updates.

@@ -0,0 +1,658 @@
package spec

@dergeberl dergeberl Jun 29, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add tests for:

  • the AnnotationHTTPSOnly and that it not serves HTTP.
  • the sorting based on AnnotationPriority and creation timestamp.
  • AnnotationPlanID
  • the handling when setting not allowed values in annotation (like no valid int/port, not valid cidr in source range)
  • GetUnusedCertificates

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for 1, 2 and 5 in c9d5229

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment

Comment thread deploy/deployment.yaml
@@ -0,0 +1,85 @@
apiVersion: apps/v1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should align this with the deploy of CSI and CCM in cloud-provider-stackit (which are also not perfect, see stackitcloud/cloud-provider-stackit#1162).
For example:

  • we can use also the kube-system namespace as this is also a system component like CSI and CCM
  • the cloud config don't need to be a secret
  • the mount path of the service account and cloud config
  • the apiEndpoints don't need to be in the example (this is only internally needed)
  • lets also create a /deploy folder instead if the /samples/deploy/
  • we need to find a image tag which is working and don't need to be updated like latest (not sure if this will be tagged on release)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #6 we get a latest tag for tagged versions. Lets use that in the deployment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5248047

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for latest tag here. We can leave it like it is for now and solve this directly before merging. After merge I will create a release which then also creates the latest tag

Comment thread pkg/controller/ingress/ingressclass_controller.go Outdated
ConditionNodeTermination corev1.NodeConditionType = "Terminating"
)

func isNodeTerminating(node *corev1.Node) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That changes also need to be added to the nodePredicate().

Comment on lines +161 to +175
if !httpsOnly && (httpPort <= 0 || httpPort > math.MaxUint16) {
errors = append(errors, ErrorEvent{
Ingress: ingress,
Description: "HTTP port is out of range.",
})
continue
}
if len(ingress.Spec.TLS) > 0 && (httpsPort <= 0 || httpsPort > math.MaxUint16) {
errors = append(errors, ErrorEvent{
Ingress: ingress,
Description: "HTTPS port is out of range.",
})
continue
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test for that.

Comment thread cmd/application-load-balancer-controller/main.go
Comment thread cmd/application-load-balancer-controller/main.go
Comment thread hack/tools.mk Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/enhancement Enhancement, improvement, extension size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants