Production Environment Configuration

Node Pools and Nodes Sizing

The cluster consists of two node pools: agentpool1, designated as the system node pool (control plane), and nodepool1, which serves as the worker node pool. To ensure reliability, agentpool1 contains two nodes, and nodepool1 also has two nodes to maintain a minimum level of workload availability.

The node pools utilize different machine sizes: Standard_D2ds_v5 for agentpool1 and Standard_E2as_v5 for nodepool1. The infrastructure allows for horizontal scaling, enabling the addition of more nodes to the existing node pools or the creation of additional node pools.

Availability zones are leveraged to mitigate outages in supported regions. Both control plane components and node pool nodes are distributed across multiple availability zones. Note that availability zones can only be configured at the time of node pool creation and cannot be modified afterward. For further details, refer to the Microsoft documentation.

Pod Deployment and Affinity

To enhance availability and reliability, each core service—adb-accounting, adb-contracts, adb-parts, adb-persons, and adb-utilities—is deployed with two replicas, controlled through the replicas deployment attribute.

Based on user feedback from approximately two years ago (Kevin), the application exhibited significantly better performance in a development environment running on a single server compared to its deployment in Azure. Since pod resource limits and JVM options were identical in both environments, it was inferred that reduced network latency played a significant role in performance improvement.

Given that no major changes have been made to the network communication pattern, the deployment is configured to ensure that services crucial to business operations run within the same node. Using podAffinity and podAntiAffinity, adb-accounting pods are restricted to different availability zones, while adb-contracts, adb-parts, adb-persons, and adb-utilities must run on the same node as adb-accounting. Additionally, two pods of the same service cannot run on the same node. By setting internalTrafficPolicy to Local, network communication between these services remains within a single physical machine, reducing latency.

Deployment Updates and Zero Downtime Rollout

To facilitate zero-downtime deployments, rolling updates are configured to prevent service interruptions. Without additional configurations, the cluster would attempt to maintain two pods of the older version while keeping the new version’s pods in a pending state due to insufficient resources. To avoid this, the RollingUpdate strategy ensures that Kubernetes shuts down one older pod before creating a newer version pod, continuing this process iteratively. More details on Kubernetes updates can be found here.

Resource

To prevent resource exhaustion during computation spikes (e.g., CFR generation), the pod resource requests and limits are set to the same values. This ensures that pods always have sufficient resources allocated and are not subject to unpredictable throttling. Additionally, the JVM -XX:MaxRAM option is set to match the pod memory limit, allowing the JVM to utilize all available memory within the pod efficiently.

Network Policy

The AKS cluster is configured with the well-known open-source Calico network policy. While Kubernetes NetworkPolicy is not currently restricting network traffic between pods, this can be configured in the future. The network policy is enabled at the time of cluster provisioning, as it cannot be added afterward.