What is Topology spread constraints? Meaning, Architecture, Examples, Use Cases, and How to Measure It (2026 Guide)

Quick Definition (30–60 words)

Topology spread constraints are Kubernetes scheduling rules that ensure Pods are distributed across failure domains like nodes, zones, or regions to reduce correlated failures. Analogy: like seating guests at different tables so a single extinguisher won’t clear a whole party. Formal: a Kubernetes PodSpec field guiding the scheduler to maximize even Pod placement across a specified topologyKey.

What is Topology spread constraints?

Topology spread constraints are scheduling policies in Kubernetes that guide how replicas of workloads are distributed across defined topology domains (node, zone, region, custom labels). They are not a replacement for higher-level resilience like global active-active architecture or multi-cluster control planes.

What it is NOT:

Not a traffic load balancing mechanism.
Not an automatic failover or state replication system.
Not a guarantee against all correlated failures; it reduces probability.

Key properties and constraints:

Controlled via PodSpec fields: topologySpreadConstraints.
Supports maxSkew, topologyKey, whenUnsatisfiable, and labelSelectors.
Two whenUnsatisfiable modes: DoNotSchedule and ScheduleAnyway (best-effort).
Works at scheduler time; reconciliation and evictions are separate concerns.
Depends on node labeling and cluster topology awareness.
Interacts with PodDisruptionBudgets, affinity/anti-affinity, and taints/tolerations.

Where it fits in modern cloud/SRE workflows:

Reliability engineering: reduce blast radius by distributing replicas.
Capacity planning: informs placement choices alongside resource requests.
CI/CD: included in Pod specs during deployment pipelines.
Observability: telemetry to show distribution, imbalance, and scheduling failures.
Security: used to ensure workloads span availability boundaries for compliance.

Diagram description (text-only):

Imagine a grid of clusters, each cluster has zones, zones have nodes.
A Deployment with 12 replicas has a rule: maxSkew 1 across topologyKey zone.
Scheduler maps Pods so no zone differs by more than 1 Pod.
If a zone loses a node, pods in other zones remain within skew by rescheduling.

Topology spread constraints in one sentence

A Kubernetes scheduling construct that enforces spread of Pods across named topology domains to reduce correlated failures and improve availability.

Topology spread constraints vs related terms (TABLE REQUIRED)

ID	Term	How it differs from Topology spread constraints	Common confusion
T1	Pod anti-affinity	Anti-affinity restricts specific Pods from colocating on the same topology node or host	People think both do same thing
T2	Node affinity	Node affinity targets nodes by labels for placement, not spread balancing	Often conflated with spread
T3	PodDisruptionBudget	PDB limits voluntary disruptions, not placement distribution	PDBs don’t schedule Pods
T4	DaemonSet	DaemonSet schedules one Pod per node, not balancing replicas	Developers confuse guaranteed per-node placement
T5	StatefulSet	Controls stable identities; spread is scheduling concern	StatefulSet has its own ordering semantics
T6	PriorityClass	Priority affects preemption, not distribution across topologies	High priority can still be uneven
T7	Taints/Tolerations	Taints block placement unless tolerated; spread needs label topology keys	Taints are not distribution
T8	Scheduler Extender	Extenders modify scheduling decisions beyond built-in spread	People assume extenders are needed for spread
T9	ReplicaSet	ReplicaSet ensures desired replica count not topology distribution	ReplicaSet does not enforce skew
T10	Multi-cluster controllers	Multi-cluster controls cross-cluster placement, while spread is intra-cluster	Confused with global distribution

Row Details (only if any cell says “See details below”)

None required.

Why does Topology spread constraints matter?

Business impact:

Revenue: Proper distribution reduces outages from single-zone failures, avoiding revenue loss during incidents.
Trust: Higher availability builds customer confidence.
Risk: Limits single-point failures that cause regulatory or contractual violations.

Engineering impact:

Incident reduction: Less correlated failures, fewer cascading incidents.
Velocity: Teams can deploy safer defaults and focus on features instead of manual placement tuning.
Complexity trade-off: Misconfiguration can impede scheduler efficiency and cause pod evictions.

SRE framing:

SLIs/SLOs: A spread-related SLI could be “fraction of replicas within allowed skew per workload”.
Error budgets: Violations tied to distribution can consume error budget and trigger remediation.
Toil: Automating topology-aware deployments reduces manual interventions.
On-call: Alerts surface topology imbalance and scheduling failures as high-severity issues.

What breaks in production (realistic examples):

Zone outage concentrates all replicas in remaining zones causing over-capacity and degraded latency.
Autoscaler launches many Pods in one node due to cloud provider API lag, violating spread and causing hot nodes.
Stateful workload scheduled unevenly leading to quorums lost in distributed databases.
Misconfigured labels/topologyKey lead to all replicas placed on control-plane nodes, increasing blast radius.
Interaction with PDBs prevents rescheduling during voluntary maintenance, causing capacity shortfall.

Where is Topology spread constraints used? (TABLE REQUIRED)

ID	Layer/Area	How Topology spread constraints appears	Typical telemetry	Common tools
L1	Edge	Spread across edge nodes or POPs to avoid single POP failure	Pod distribution by POP and latency variance	Kubernetes scheduler Prometheus
L2	Network	Avoid colocating critical network proxies on same switch	Pod-to-switch mapping and packet loss	CNI metrics and node exporter
L3	Service	Spread replicas of a microservice across AZs	Replica skew and request error rates	Service metrics and traces
L4	App	App-level replicas distributed across nodes for resilience	Instance health and restart counts	App logs and Prometheus
L5	Data	Distribute stateless cache frontends across zones	Cache hit ratio and evictions	Observability tools and Redis metrics
L6	IaaS	Uses provider availability zones or host aggregates as topologyKey	Cloud zone health and node labels	Cloud provider telemetry
L7	PaaS/Kubernetes	Native Kubernetes PodSpec topologySpreadConstraints	Scheduler events and pod scheduling failures	kubectl, kube-state-metrics, Prometheus
L8	Serverless	Managed runtime may expose placement controls or none	Invocation latency skew and cold starts	Provider metrics (varies)
L9	CI/CD	Deployment manifests include topology rules in pipelines	Deployment validation and lint results	GitOps pipelines and test suites
L10	Incident Response	Used as mitigation to reduce impact scope during incidents	Distribution imbalance alerts and PDBs	On-call tools and runbooks

Row Details (only if needed)

None required.

When should you use Topology spread constraints?

When necessary:

Critical services with multiple replicas that must survive zone/node loss.
Stateful systems requiring quorum distribution.
Compliance or contractual needs to distribute across failure domains.

When optional:

Burst workloads that are cheap to recreate and have short lifetimes.
Development or test namespaces where availability is not critical.

When NOT to use / overuse it:

Small clusters where topology domains are too limited; constraints will block scheduling.
Highly dynamic, ephemeral jobs that add overhead to scheduler decisions.
When underlying cloud provider topology is unreliable or unlabeled.

Decision checklist:

If you need high availability across zones and you have >1 zone -> use topologySpreadConstraints.
If pods are stateful requiring unique identities -> consider StatefulSet + spread rules.
If cluster has limited nodes per topology domain -> prefer ScheduleAnyway and add capacity planning.
If Pod disruption budgets block rescheduling during maintenance -> coordinate PDBs and spread.

Maturity ladder:

Beginner: Use simple evenPodSpread with topologyKey=zone and whenUnsatisfiable=ScheduleAnyway.
Intermediate: Combine spread with anti-affinity and PDBs; add scheduler metrics.
Advanced: Automated placement controllers, cross-cluster distribution, chaos testing and automated remediation.

How does Topology spread constraints work?

Components and workflow:

PodSpec contains topologySpreadConstraints list per workload.
Scheduler evaluates available nodes using constraints in score and filter phases.
maxSkew defines acceptable difference between most and least populated topology domains.
topologyKey is a node label key e.g., topology.kubernetes.io/zone or custom key.
whenUnsatisfiable decides strictness: DoNotSchedule blocks scheduling if constraint violated; ScheduleAnyway allows best-effort.
Label selector restricts set of Pods considered when computing skew.

Data flow and lifecycle:

Deployment creates ReplicaSet which creates Pods with spread constraints.
Scheduler checks nodes, gathers counts of matching Pods per topology domain.
Scheduler filters nodes that would violate DoNotSchedule; otherwise ranks nodes by evenness.
Pod is bound to node; kubelet starts container.
Rescheduler, eviction, or node failure triggers new scheduling decisions recalculating spread.

Edge cases and failure modes:

Insufficient nodes in a topology domain cause DoNotSchedule rejections.
Missing or inconsistent node labels break domain grouping leading to uneven distribution.
Interaction with pod anti-affinity and taints can create unsatisfiable conflict.
Scale-up latency from cluster autoscaler may produce temporary imbalance.
Admission controllers or mutating webhooks that change labels can affect computed skew.

Typical architecture patterns for Topology spread constraints

Pattern: Zone-based even spread
Use when: Multi-AZ clusters requiring resilience to AZ failure.
Pattern: Node-group balancing
Use when: Need to avoid localized hot-spots on instance groups.
Pattern: Rack-aware spread
Use when: On-prem or bare-metal clusters with rack labels.
Pattern: Service-role spread
Use when: Spread across nodes with distinct hardware or NICs.
Pattern: Multi-cluster-aware spread controller
Use when: Global distribution across clusters; often combined with higher orchestration.
Pattern: Canary with spread override
Use when: Canary rollout should target specific topologies first.

Failure modes & mitigation (TABLE REQUIRED)

ID	Failure mode	Symptom	Likely cause	Mitigation	Observability signal
F1	Scheduling blocked	Pods Pending indefinitely	DoNotSchedule with insufficient nodes	Use ScheduleAnyway or add capacity	Pending pod age and events
F2	Uneven distribution	One zone has many replicas	Missing node labels or scheduler issue	Fix labels or adjust selectors	Replica skew metric by zone
F3	Thrash after node loss	Frequent reschedules and restarts	PDBs prevent evictions or capacity shortage	Adjust PDBs and autoscaler	Restart counts and eviction rate
F4	Quorum loss	Distributed DB loses quorum	Replica placement clustered on few nodes	Enforce stronger spread and anti-affinity	Error spikes and DB leader changes
F5	Performance hotspot	High CPU on nodes hosting many pods	Scheduler scored incorrectly or affinities override	Tune scheduler weights and affinity	Node CPU and request latency
F6	Interference with anti-affinity	Unschedulable pods due to conflicting rules	Conflicting pod affinity rules	Reconcile rules or create exceptions	Scheduler events and filtering reasons
F7	Over-constraining during scale	New replicas force scheduler failures	Too strict maxSkew or DoNotSchedule	Relax skew or provide more capacity	Deployment scaling failures

Row Details (only if needed)

None required.

Key Concepts, Keywords & Terminology for Topology spread constraints

(40+ glossary entries)

Topology spread constraints — Scheduler rules in PodSpec to spread Pods across topology domains — Ensures replica distribution — Pitfall: over-constraining can block scheduling.
topologyKey — Node label key used as domain, e.g., zone — Determines grouping for spread — Pitfall: unlabeled nodes ignored.
maxSkew — Maximum allowed difference between counts in domains — Controls evenness — Pitfall: low value prevents scaling.
whenUnsatisfiable — Behavior when constraint cannot be met (DoNotSchedule or ScheduleAnyway) — Dictates strictness — Pitfall: DoNotSchedule can block pods.
labelSelector — Selects which Pods are counted for skew — Targets specific sets — Pitfall: wrong selector miscomputes skew.
Scheduler — Kubernetes component that binds Pods to nodes — Implements spread logic — Pitfall: custom extenders may override behavior.
PodSpec — Pod definition in Kubernetes — Hosts topologySpreadConstraints — Pitfall: forgetting to add constraints in templates.
Node labels — Key/value metadata on nodes used as keys — Used as topology domains — Pitfall: inconsistent labeling across nodes.
Zone — Cloud availability zone concept used as a topology layer — Common topologyKey — Pitfall: provider zones vary in naming.
Region — Higher-level topology domain above zones — Useful for multi-region clusters — Pitfall: cross-region latency.
Anti-affinity — Constraint to avoid colocating specific Pods — Related to spread but targets affinity between Pods — Pitfall: conflicts with spread constraints.
Affinity — Rules to prefer or require co-location — Opposite of anti-affinity — Pitfall: affinity can override spread preferences.
PDB (PodDisruptionBudget) — Limits voluntary disruptions — Ensures minimum available replicas — Pitfall: can block maintenances.
ReplicaSet — Controller ensuring replica count — Works with spread but doesn’t enforce it — Pitfall: scale spikes may affect skew.
StatefulSet — Controller for stateful apps with stable identities — Works with spread but ordering matters — Pitfall: startup order can delay distribution.
DaemonSet — Ensures Pods on each node — Not used for spread but used for per-node services — Pitfall: conflicts with capacity.
Taints — Node-level markers preventing placement unless tolerated — Affects where Pods can land — Pitfall: missing tolerations cause scheduling failures.
Tolerations — Pod settings to tolerate taints — Needed when using taints — Pitfall: over-tolerating can increase blast radius.
Scheduler Extender — External component to influence scheduling — Rarely needed for basic spread — Pitfall: complexity and maintenance.
kube-scheduler-policy — Deprecated approach for scheduler behavior — Historical context — Pitfall: version mismatch.
kube-state-metrics — Emits Kubernetes object metrics — Useful for counting Pod distribution — Pitfall: cardinality explosion if misused.
Prometheus — Monitoring system to record distribution metrics — Enables alerts — Pitfall: incomplete instrumentation.
Cluster Autoscaler — Scales node pools based on pending pods — Helps satisfy DoNotSchedule constraints — Pitfall: slow scale-up can leave pods pending.
NodePool — Group of nodes with similar properties — Often corresponds to topology unit — Pitfall: single nodepool in multiple zones causes misleading labels.
Quorum — Minimum nodes/replicas required for consistency — Important for stateful systems — Pitfall: poor distribution risks quorum loss.
Reachability — Network connectivity between nodes/zones — Underpins effective spread — Pitfall: network partitions invalidate benefit.
Scheduling Score — Numeric value the scheduler uses to prefer nodes — Spread contributes to scoring — Pitfall: scoring weights might not be tuned.
Scheduler Predicates — Pre-filter checks to allow candidate nodes — Spread can be enforced here for DoNotSchedule — Pitfall: predicates conflict can block scheduling.
Eviction — Forcing Pod removal from node — Interacts with spread during node failures — Pitfall: mass evictions cause reschedule storms.
Rescheduler — Component or controller that may attempt to rebalance — Not always present by default — Pitfall: absence means imbalance persists.
Best-effort scheduling — whenUnsatisfiable=ScheduleAnyway — Allows placement disregarding strictness — Pitfall: temporary imbalance.
Admission controller — Validates or mutates Pod objects — Can inject topology defaults — Pitfall: misconfiguration mutates constraints unexpectedly.
GitOps — Declarative deployment patterns — Distributes constraints via manifests — Pitfall: drift between clusters.
Canary — Gradual rollout pattern — Spread ensures canary isn’t isolated — Pitfall: canary may not represent full topology.
Chaos engineering — Failure injection to validate spread resilience — Tests constraint effectiveness — Pitfall: risk if not scoped.
Observability signal — Telemetry that indicates spread health — Crucial for alerts — Pitfall: missing signals hide issues.
SLI — Service Level Indicator relevant to spread such as replica skew fraction — Measurement basis — Pitfall: wrong SLI leads to meaningless alerts.
SLO — Service Level Objective based on SLI — Drives error budget for spread violations — Pitfall: unrealistic SLO.
Error budget — Allowance for SLO violation — Used to guide remediation — Pitfall: exhaustion causes operational limits.
Load balancing — Distributes traffic across replicas — Complementary to spread — Pitfall: spread does not change LB behavior.
Multi-cluster — Managing workloads across clusters — Spread is intra-cluster; multi-cluster expands scope — Pitfall: relying solely on spread for global resilience.

How to Measure Topology spread constraints (Metrics, SLIs, SLOs) (TABLE REQUIRED)

ID	Metric/SLI	What it tells you	How to measure	Starting target	Gotchas
M1	Replica skew per topology domain	Degree of imbalance across domains	Count matching pods grouped by topology label	<=1 difference for critical services	Label mismatches hide pods
M2	Pending pods due to DoNotSchedule	Pods blocked by strict rules	Count pending pods with scheduling reason Unschedulable	0 sustained pending	Transient spikes possible
M3	Scheduling failure rate	Frequency of scheduling errors	Count failed schedule events per minute	<1% of scheduling ops	Events sampling can miss issues
M4	Reschedule frequency per pod	Pod churn due to evictions	Count restarts and reschedules per pod per hour	<1 per day for stable services	Autoscaler churn inflates metric
M5	PDB violation incidents	Number of times PDB blocked eviction	PDB violations and voluntary disruption failures	0 for critical services	Necessary maintenance may trigger
M6	Node distribution variance	Statistical variance of pods per domain	Compute variance across domain counts	Low variance per service	Small sample sizes distort variance
M7	Time to restore skew	Time to return to allowed skew after failure	Time from imbalance detected to restored	< 10 minutes for autoscaling setups	Slow autoscale or manual steps increase time
M8	Error rate correlated to skew	Application errors related to imbalance	Error rate during high skew windows	SLO-based, e.g., <1% increase	Correlation requires traceability
M9	Capacity shortfall events	When schedulable capacity insufficient	Count events where scale-up requested	0 frequent events	Spot instance interruptions can cause bursts
M10	Scheduler latency impacted by constraints	Effect of constraints on scheduling time	Measure scheduler decision latency per pod	Minimal impact target <200ms	High cluster size increases latency

Row Details (only if needed)

None required.

Best tools to measure Topology spread constraints

H4: Tool — Prometheus

What it measures for Topology spread constraints:
Pod counts per topology domain and scheduling events.
Best-fit environment:
Kubernetes clusters with kube-state-metrics.
Setup outline:
Install kube-state-metrics.
Scrape metrics from kube-state-metrics and kube-scheduler.
Create recording rules for replica skew.
Build dashboards and alerts.
Strengths:
Query language for flexible metrics and alerts.
Widely adopted in cloud-native stacks.
Limitations:
High cardinality can increase storage cost.
Needs proper instrumentation for scheduler internals.

H4: Tool — kube-state-metrics

What it measures for Topology spread constraints:
Kubernetes object state including Pod labels and counts.
Best-fit environment:
Any Kubernetes cluster feeding Prometheus.
Setup outline:
Deploy in cluster.
Ensure RBAC allows reading pods and nodes.
Map metrics to topology label aggregations.
Strengths:
Exposes native object metrics.
Low-level visibility.
Limitations:
Not a complete telemetry solution on its own.

H4: Tool — kube-scheduler metrics / logs

What it measures for Topology spread constraints:
Scheduler decisions, filtering reasons, and latencies.
Best-fit environment:
Clusters where you can access control plane metrics.
Setup outline:
Enable scheduler metrics.
Collect logs and metric endpoints securely.
Create alerts for high filter rates.
Strengths:
Direct insight into scheduling decisions.
Limitations:
Access restricted in managed clusters.

H4: Tool — Grafana

What it measures for Topology spread constraints:
Visualization and dashboards for spread metrics.
Best-fit environment:
Any environment with Prometheus.
Setup outline:
Create dashboards with panels for skew, pending pods, scheduler latency.
Provide templates for services and topology keys.
Strengths:
Flexible dashboards and annotations.
Limitations:
Needs backing data source; not a collector.

H4: Tool — Cluster Autoscaler

What it measures for Topology spread constraints:
Scale-up requests triggered by pending pods and unschedulable conditions.
Best-fit environment:
Cloud-managed or self-hosted autoscaler-enabled clusters.
Setup outline:
Configure node pools and scale settings.
Monitor pending pods and scale events.
Strengths:
Automated capacity reactions to satisfy DoNotSchedule.
Limitations:
Scale-up latency; cost implications.

H4: Tool — Observability/tracing (Jaeger/Tempo)

What it measures for Topology spread constraints:
Correlate user-facing errors to deployment topology imbalances.
Best-fit environment:
Microservice architectures with tracing.
Setup outline:
Instrument services with tracing.
Tag traces with node/zone metadata.
Analyze error spikes relative to skew events.
Strengths:
Deep root cause correlation across systems.
Limitations:
Instrumentation overhead and sampling limits.

Recommended dashboards & alerts for Topology spread constraints

Executive dashboard:

Panels:
Overall percent of workloads within allowed skew — executive-friendly availability metric.
Number of critical services impacted by imbalance — business impact view.
Recent severe scheduling incidents and time to restore.
Why:
High-level health and business exposure.

On-call dashboard:

Panels:
Replica skew per topologyKey for the service in question.
Pending pods with scheduling reasons and age.
Node CPU/memory across topology domains.
Scheduler filter/failure logs for recent events.
Why:
Rapid triage and root-cause identification.

Debug dashboard:

Panels:
Pod distribution heatmap by node and label.
Scheduler decision trace for sample pods.
PDB status and eviction history.
Autoscaler scale events and pending pod queue.
Why:
Deep troubleshooting during incidents.

Alerting guidance:

Page vs ticket:
Page: sustained pending pods due to DoNotSchedule for production critical services; time to restore exceeds threshold; quorum loss events.
Ticket: transient imbalance below severity thresholds, minor skew that auto-corrects within short windows.
Burn-rate guidance:
Tie spread-related SLOs to error budget; increased burn-rate due to distribution violations should trigger escalations and remediation playbooks.
Noise reduction tactics:
Dedupe by topologyKey and service.
Group alerts for a service rather than individual pods.
Suppress alerts during planned maintenance windows and autoscaler correction windows.

Implementation Guide (Step-by-step)

1) Prerequisites – Kubernetes cluster with multiple topology domains labeled. – RBAC to allow kube-state-metrics and observability tools. – CI/CD pipeline that deploys manifests. – Monitoring and alerting system (Prometheus + Grafana or equivalent).

2) Instrumentation plan – Add topologySpreadConstraints to Pod templates in manifests. – Ensure node labels exist and are consistent. – Emit custom metrics for replica skew and pending pods.

3) Data collection – Deploy kube-state-metrics and scrape with Prometheus. – Collect kube-scheduler metrics and events. – Gather node exporter or cloud provider telemetry for topology health.

4) SLO design – Define SLI: fraction of critical services within allowed skew. – Set SLO: e.g., 99.9% of time replicas within allowed skew over 30 days (example starting point). – Define alert thresholds for sustained imbalance.

5) Dashboards – Build executive, on-call, and debug dashboards described above. – Add templating to select service, namespace, and topologyKey.

6) Alerts & routing – Create alerts for pending pods, skew breaches, and scheduling failures. – Route critical pages to SRE on-call; minor tickets to application owners.

7) Runbooks & automation – Create runbooks for common faults: pending pods, imbalance, label drift. – Automate remediation: trigger autoscaler, add node labels, or recreate failing nodes.

8) Validation (load/chaos/game days) – Run chaos experiments to simulate zone/node failures. – Test autoscaler response and restoration time. – Verify PDB interactions and failover behavior.

9) Continuous improvement – Weekly reviews of skew incidents and root causes. – Iterate constraints per workload maturity. – Automate recommended changes via GitOps.

Pre-production checklist:

Node labels present and consistent across nodes.
Metrics and dashboards show expected baseline.
CI/CD pipelines include topologySpreadConstraints linting.
PDBs and other constraints reviewed for conflicts.
Autoscaler behavior validated for pending pods.

Production readiness checklist:

Alerts configured and tested to page on real conditions.
Runbooks available and practiced.
Capacity buffer exists for expected failovers.
Chaos tests passed at least once in staging.

Incident checklist specific to Topology spread constraints:

Check pending pods and scheduling reasons.
Inspect node labels and topologyKeys.
Verify PDBs are not blocking evictions.
Check cluster autoscaler logs and scale events.
If necessary, relax whenUnsatisfiable or adjust maxSkew temporarily.

Use Cases of Topology spread constraints

Provide 8–12 use cases:

1) Highly available frontend service – Context: Public API served from multi-AZ cluster. – Problem: Zone failure should not take down all replicas. – Why it helps: Ensures replicas distributed across AZs. – What to measure: Replica skew by zone, error rate. – Typical tools: Prometheus, kube-state-metrics, Grafana.

2) Distributed database coordinator placement – Context: Small controller nodes managing DB shards. – Problem: Concentrated placement triggers quorum loss. – Why it helps: Keeps coordinators in different topologies to maintain quorum. – What to measure: Leader changes, skew, latency. – Typical tools: Database metrics, scheduler logs.

3) Edge POP service distribution – Context: Edge compute across POPs or racks. – Problem: Single POP outage degrades regional service. – Why it helps: Ensures at least one replica in each POP. – What to measure: POP-level availability and latency. – Typical tools: Edge monitoring, node labels.

4) Stateful cache frontends – Context: Cache layer in front of DB. – Problem: Co-located instances on one node cause CPU contention. – Why it helps: Spreads cache replicas to reduce hotspots. – What to measure: Node CPU, cache hit ratio. – Typical tools: Prometheus, node exporter.

5) Canary rollouts sensitive to topology – Context: Feature rollout in stages. – Problem: Canary all in one zone may not reveal cross-zone issues. – Why it helps: Place canary across topologies for better validation. – What to measure: Error flux across zones. – Typical tools: CI/CD pipeline, metrics.

6) Multi-tenant isolation – Context: Tenants with SLA requirements. – Problem: Tenant pods collocated causing noisy neighbor issues. – Why it helps: Use topology keys to direct tenant replicas across host groups. – What to measure: Tenant latency and node contention. – Typical tools: Kubernetes labels and scheduler constraints.

7) Compliance-driven distribution – Context: Data residency and redundancy requirements. – Problem: Regulatory needs replicas across regions or racks. – Why it helps: Enforces placement across specific topology labels. – What to measure: Placement compliance and audit logs. – Typical tools: Policy-as-code and admission controls.

8) Autoscaling resilience – Context: Rapid traffic increases causing autoscaling. – Problem: New replicas placed unevenly, burdening nodes. – Why it helps: Distributes newly created pods across nodes to avoid hotspots. – What to measure: New pod distribution and node utilization. – Typical tools: Cluster Autoscaler and scheduler metrics.

9) Security isolation for critical workloads – Context: Critical microservices must avoid colocating with others. – Problem: Co-location increases attack surface. – Why it helps: Spread constraints combined with taints isolate critical services. – What to measure: Security incidents and placement adherence. – Typical tools: Admission controllers and policy engines.

10) Legacy workload modernization – Context: Migrating monolith into microservices across zones. – Problem: Initial deployments concentrated in one topology. – Why it helps: Gradually enforce spread for safe migration. – What to measure: Service availability and skew during cutover. – Typical tools: GitOps and observability.

Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes multi-AZ web service

Context: A customer-facing web service in a 3-AZ cluster.
Goal: Ensure at least one replica per AZ and limit skew to 1.
Why Topology spread constraints matters here: Prevents all replicas landing in single AZ during AZ failure.
Architecture / workflow: Deployment with replicas: 9, topologySpreadConstraints with topologyKey=topology.kubernetes.io/zone, maxSkew=1, whenUnsatisfiable=DoNotSchedule. PDB ensures minimal availability during maintenance.
Step-by-step implementation:

Label nodes with zone labels correctly.
Add topologySpreadConstraints to Deployment Pod template.
Configure PDB to allow safe maintenance.
Monitor replica counts per zone and pending pods.
Test with simulated AZ failure via chaos test. What to measure: Replica skew per zone, pending pods due to DoNotSchedule, time to restore.
Tools to use and why: Prometheus, kube-state-metrics, Grafana, chaos tool for failover.
Common pitfalls: DoNotSchedule blocks pods if capacity limited; autoscaler latency causes pending pods.
Validation: Run zone failure simulation and confirm no service downtime and skew restored within target time.
Outcome: Service remains available across AZs and recovers automatically.

Scenario #2 — Serverless managed PaaS with placement controls

Context: Managed PaaS offering allows specifying placement affinity via annotations.
Goal: Reduce correlated failures by allocating function instances across edge clusters or zones.
Why Topology spread constraints matters here: Even in managed environments, placement across domains reduces single-domain outages.
Architecture / workflow: Platform maps annotations into underlying Kubernetes topologySpreadConstraints or provider placement policies. The provider exposes a config to request spread across regions.
Step-by-step implementation:

Request placement profile in application manifest.
Platform validates and translates to topology keys.
Operator reviews mapping to provider topologies.
Deploy and measure instance distribution. What to measure: Instance skew across domains, invocation latency variance.
Tools to use and why: Provider metrics, platform logs, Prometheus if available.
Common pitfalls: Provider limitations may ignore hints; lack of direct control.
Validation: Trigger region outage test and observe failover.
Outcome: Functions continue to operate with reduced correlated failure risk.

Scenario #3 — Incident response and postmortem for quorum loss

Context: A distributed DB loses quorum unexpectedly.
Goal: Identify placement causes and fix distribution to prevent recurrence.
Why Topology spread constraints matters here: Misplacement caused too many replicas on same node group leading to correlated failure.
Architecture / workflow: Examine deployment topologySpreadConstraints, PDB, and node labels. Use tracing to correlate DB leader changes with node events.
Step-by-step implementation:

Run queries for pod placement at incident time.
Check scheduler events and pending pods.
Verify node failures or preemptions occurred.
Update topologySpreadConstraints and PDBs as corrective action.
Run postmortem and test changes in staging with chaos tests. What to measure: Time to quorum loss, number of replicas per topology at failure.
Tools to use and why: Scheduler logs, Prometheus, tracing, DB logs.
Common pitfalls: Missing historical metrics to correlate sequence of events.
Validation: Confirm new placement prevents quorum loss under same failure scenario.
Outcome: Root cause attributed to uneven placement; constraints updated and verified.

Scenario #4 — Cost vs performance trade-off for spread constraints

Context: High-cost multi-AZ deployment with expensive cross-AZ networking.
Goal: Balance cost and resilience by selectively applying spread to critical services.
Why Topology spread constraints matters here: Spreading every workload across AZs inflates cross-AZ costs; selective use optimizes trade-off.
Architecture / workflow: Tier services into critical and non-critical. Apply strict spread to critical; use ScheduleAnyway for non-critical. Monitor costs and performance.
Step-by-step implementation:

Classify services by criticality and cost impact.
Update manifests appropriately.
Run load tests to measure cross-AZ traffic and latency.
Monitor cost telemetry and performance metrics. What to measure: Cross-AZ network egress, replica skew, user latency.
Tools to use and why: Cost reporting tools, Prometheus, tracing.
Common pitfalls: Over-optimization reduces resilience unintentionally.
Validation: Validate failover for critical services and acceptable degradation for non-critical ones.
Outcome: Cost optimized while maintaining SLOs for critical services.

Common Mistakes, Anti-patterns, and Troubleshooting

List of mistakes with Symptom -> Root cause -> Fix (15–25 items)

Symptom: Pods Pending with Unschedulable reason -> Root cause: DoNotSchedule with not enough capacity -> Fix: Temporarily set ScheduleAnyway or add capacity.
Symptom: All replicas on one node -> Root cause: Missing topology node labels -> Fix: Apply consistent labels to nodes.
Symptom: High scheduler latency -> Root cause: Excessive constraints and high cluster size -> Fix: Simplify constraints or tune scheduler performance.
Symptom: PDB blocks maintenance -> Root cause: Strict PDB plus DoNotSchedule -> Fix: Adjust PDBs or maintenance windows.
Symptom: Quorum loss in DB -> Root cause: Replicas collocated on same topology -> Fix: Enforce stronger spread with anti-affinity.
Symptom: Excessive pod restarts after failure -> Root cause: Eviction storms due to mass rescheduling -> Fix: Stagger reschedules or add graceful draining.
Symptom: Canary not representative -> Root cause: Canary instances in a single topology -> Fix: Spread canary across domains.
Symptom: Unexpected placement in cloud provider -> Root cause: Provider overrides or mapping differences -> Fix: Validate provider topology mapping and labels.
Symptom: Observability gaps during incidents -> Root cause: No embed of topology metadata in traces -> Fix: Tag traces/metrics with node/zone metadata.
Symptom: Alerts too noisy -> Root cause: Too-sensitive thresholds or no suppression during autoscale -> Fix: Add grouping, suppression windows, and dedupe.
Symptom: Conflict between anti-affinity and spread -> Root cause: Contradictory Pod rules -> Fix: Reconcile rules or create exceptions.
Symptom: Scale-up creates uneven nodes -> Root cause: Autoscaler adds nodes in one pool only -> Fix: Configure autoscaler with balanced node pools and scale policies.
Symptom: Label drift over time -> Root cause: Automation or manual changes altering node labels -> Fix: Enforce labels via daemon or policy controller.
Symptom: Metrics cardinality explosion -> Root cause: Metrics per pod with high label set -> Fix: Aggregate metrics by topology key, drop high-cardinality labels.
Symptom: Misapplied constraints in CI -> Root cause: Incomplete manifests or templating issues -> Fix: Add linting and manifest validation to pipeline.
Symptom: Timing-sensitive imbalance -> Root cause: Slow autoscaler reaction -> Fix: Pre-warm capacity or predictive scaling.
Symptom: Rogue mutating webhook alters constraints -> Root cause: Webhook changing PodSpec labels -> Fix: Audit and restrict webhooks.
Symptom: Inconsistent behavior across clusters -> Root cause: Cluster version differences and scheduler flags -> Fix: Standardize cluster configurations.
Symptom: High network cost after spreading -> Root cause: Cross-AZ traffic increased by spreading stateful caches -> Fix: Selective spread and data locality considerations.
Symptom: Insufficient telemetry for postmortem -> Root cause: Lack of historical scheduler metrics -> Fix: Retain scheduler and pod placement metrics for relevant retention period.
Symptom: Overly complex policies -> Root cause: Mix of affinity, anti-affinity, taints and spread -> Fix: Simplify and document policy interactions.
Symptom: Security boundary breach due to placement -> Root cause: Over-tolerating taints and wide spread -> Fix: Combine spread with strict taints and node isolation.
Symptom: Unexpected cost spikes -> Root cause: Autoscaler scaling up to satisfy DoNotSchedule -> Fix: Monitor cost and tune autoscaler limits.

Observability pitfalls (at least 5 included above):

Missing topology metadata in traces prevents root cause correlation.
No historical scheduler metrics blocks incident analysis.
High metric cardinality hides useful aggregates.
Alerts fire without topology context causing noisy paging.
Lack of per-topology dashboards slows triage.

Best Practices & Operating Model

Ownership and on-call:

App teams own topologySpreadConstraints for their workloads.
SRE maintains cluster-level defaults and guidance.
On-call rotation should include SRE members with permissions to change cluster-level responses.

Runbooks vs playbooks:

Runbooks: step-by-step remediation for common symptoms.
Playbooks: higher-level decision flow for more complex, multi-service incidents.

Safe deployments:

Use canary or incremental rollouts combined with spread to minimize impact.
Always include PDBs reviewed against spread constraints.

Toil reduction and automation:

Automate label consistency with controllers.
Use admission controllers to inject standard spread constraints for critical namespaces.
Automate remediation: scale up node pools or toggle ScheduleAnyway through safe API paths.

Security basics:

Restrict who can mutate node labels or topology constraints.
Use taints for critical nodes and combine with spread to avoid accidental colocations.
Audit changes to PodSpec templates via GitOps.

Weekly/monthly routines:

Weekly: Review incidents and any skew alerts; validate runbook accuracy.
Monthly: Audit node labels and topology mappings; test autoscaler behavior.
Quarterly: Chaos game days and compliance checks on distribution.

Postmortem reviews:

Confirm whether placement or topology contributed to outage.
Review metrics like time to restore skew and pending pod counts.
Update runbooks, dashboards, and constraints based on findings.

Tooling & Integration Map for Topology spread constraints (TABLE REQUIRED)

ID	Category	What it does	Key integrations	Notes
I1	Monitoring	Collects cluster and scheduler metrics	Prometheus, kube-state-metrics	Core for measurement
I2	Visualization	Dashboards for skew and scheduling	Grafana	For executive and on-call views
I3	Autoscaling	Adds node capacity to satisfy constraints	Cluster Autoscaler	Helps DoNotSchedule situations
I4	Chaos engineering	Tests distribution resilience	Chaos tool	Validates spread under failure
I5	CI/CD	Injects and validates manifest constraints	GitOps pipelines	Prevents misconfigurations
I6	Policy enforcement	Enforces labels and constraints via policies	OPA/Admission controllers	Ensures consistency
I7	Tracing	Correlates errors with placement	Tracing systems	Adds topology metadata to traces
I8	Logging	Centralizes scheduler and node logs	Aggregated logging	Useful for incident analysis
I9	State metrics	Exposes Kubernetes object state	kube-state-metrics	Enables domain counts
I10	Cloud provider telemetry	Provides zone and host-level health	Provider metrics	Necessary for mapping and capacity

Row Details (only if needed)

None required.

Frequently Asked Questions (FAQs)

What is the default behavior if topologyKey is missing on nodes?

If nodes lack the label for topologyKey, they are treated as a single domain and skew calculation will not be meaningful; ensure label consistency.

Can topology spread constraints ensure strict equal distribution?

maxSkew controls strictness; DoNotSchedule can block scheduling to enforce it but may lead to Pending pods if capacity insufficient.

Do managed Kubernetes offerings support topology spread constraints?

Varies / depends on provider; many support native Kubernetes features but access to scheduler metrics may be limited.

How do topology spread constraints interact with PodDisruptionBudgets?

They are complementary; PDBs limit voluntary evictions while spread controls placement. Conflicts can prevent rescheduling.

Is topology spread constraints sufficient for stateful databases?

Not alone; stateful workloads often need StatefulSet semantics and explicit quorum placement in addition to spread.

What happens during horizontal scaling?

Scheduler recalculates domain counts and places new pods to respect maxSkew; autoscaler may be required for sufficient capacity.

Can topologyKey be a custom label?

Yes; any node label can be used as topologyKey, but labels must be consistent and present across nodes.

Should all workloads have topology spread constraints?

Not necessarily; apply to services requiring availability guarantees or low blast radius.

How to debug pods Pending due to spread constraints?

Inspect events for scheduling reasons, check node labels, PDBs, and cluster autoscaler activity.

Does spread affect performance or latency?

Potentially: spreading across regions or AZs may increase latency; measure cross-domain latency before broad application.

How to test spread policy before production?

Use staging clusters, simulate failures with chaos engineering, and run autoscaling scenarios.

How to automate enforcement of spread defaults?

Use admission controllers or GitOps policies to inject or validate topologySpreadConstraints in manifests.

What are reasonable starting targets for SLOs?

Varies / depends; pick conservative targets based on workload criticality and test restoration windows via chaos experiments.

Can anti-affinity be used instead of spread?

They are different; anti-affinity prevents co-location of Pods while spread enforces balance; use both when needed.

How to avoid metric cardinality explosion when measuring skew?

Aggregate by topologyKey and service rather than per-pod labels; use recording rules.

Do nodes in different AZs always have correct topology labels?

Not always; label mappings can vary by provider and cluster setup—verify label presence.

How to handle legacy workloads without Pod templates that support spread?

Wrap with higher-level controller or recreate workloads with updated manifests; use admission controller to detect and flag.

Conclusion

Topology spread constraints are a practical scheduling tool to reduce correlated failures by distributing Pods across defined topology domains. They integrate into broader SRE practices—monitoring, autoscaling, PDBs, and runbooks—and require careful measurement and testing to avoid over-constraining clusters.

Next 7 days plan (5 bullets):

Day 1: Inventory critical services and current pod distribution by topology.
Day 2: Add or validate node topology labels and deploy kube-state-metrics.
Day 3: Apply topologySpreadConstraints to one critical service in staging and monitor.
Day 4: Create dashboards and record replica skew metrics; set alerts.
Day 5–7: Run a scoped chaos test simulating a topology failure; iterate config and update runbooks.

Appendix — Topology spread constraints Keyword Cluster (SEO)

Primary keywords
Topology spread constraints
topologySpreadConstraints Kubernetes
Kubernetes spread pods
pod scheduling spread
distribute pods across zones
Secondary keywords
maxSkew Kubernetes
whenUnsatisfiable DoNotSchedule ScheduleAnyway
topologyKey node labels
replica skew metric
pod distribution monitoring
Long-tail questions
How do topology spread constraints work in Kubernetes
What is maxSkew in topology spread constraints
How to prevent all pods in one availability zone
Why are my pods pending due to topology spread constraints
How to measure pod skew across zones
Can topology spread constraints cause scheduling delays
How do PDBs interact with topology spread constraints
Best practices for topologySpreadConstraints in production
How to test topology spread constraints with chaos engineering
How to visualize replica skew in Grafana
How to label nodes for topology spread constraints
How does cluster autoscaler affect topology spread constraints
How to design SLOs for pod distribution
How to balance cost and availability with spread constraints
How to handle topologyKey differences across cloud providers
How to debug unschedulable pods due to spread
How to use topology spread constraints with StatefulSet
How to inject default topology spread constraints with admission controller
How to log scheduler decisions for topology spread
How to design canary rollouts with topology spread
Related terminology
Pod anti-affinity
Node affinity
PodDisruptionBudget
Cluster Autoscaler
kube-state-metrics
Prometheus monitoring
Grafana dashboards
Scheduler events
Eviction storms
Scheduler latency
ReplicaSet vs StatefulSet
Taints and tolerations
Node labels and topologyKey
Cloud availability zones
Region-aware scheduling
Quorum placement
Chaos engineering tests
GitOps manifest validation
Admission controllers
Policy-as-code
Observability signals
SLIs and SLOs for placement
Error budget for distribution violations
Pod scheduling predicates
Scheduler extenders
Placement constraints
Rack-aware scheduling
Edge POP distribution
Cross-AZ network cost
Best-effort scheduling
DoNotSchedule impact
ScheduleAnyway behavior
Label selector for skew
Replica skew detection
Historical scheduler metrics
Postmortem for placement incidents
Runbooks for scheduling faults
Admission webhooks for placement

Quick Definition (30–60 words)

What is Topology spread constraints?

Topology spread constraints in one sentence

Topology spread constraints vs related terms (TABLE REQUIRED)

Row Details (only if any cell says “See details below”)

Why does Topology spread constraints matter?

Where is Topology spread constraints used? (TABLE REQUIRED)

Row Details (only if needed)

When should you use Topology spread constraints?

How does Topology spread constraints work?

Typical architecture patterns for Topology spread constraints

Failure modes & mitigation (TABLE REQUIRED)

Row Details (only if needed)

Key Concepts, Keywords & Terminology for Topology spread constraints

How to Measure Topology spread constraints (Metrics, SLIs, SLOs) (TABLE REQUIRED)

Row Details (only if needed)

Best tools to measure Topology spread constraints

H4: Tool — Prometheus

H4: Tool — kube-state-metrics

H4: Tool — kube-scheduler metrics / logs

H4: Tool — Grafana

H4: Tool — Cluster Autoscaler

H4: Tool — Observability/tracing (Jaeger/Tempo)

Recommended dashboards & alerts for Topology spread constraints

Implementation Guide (Step-by-step)

Use Cases of Topology spread constraints

Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes multi-AZ web service

Scenario #2 — Serverless managed PaaS with placement controls

Scenario #3 — Incident response and postmortem for quorum loss

Scenario #4 — Cost vs performance trade-off for spread constraints

Common Mistakes, Anti-patterns, and Troubleshooting

Best Practices & Operating Model

Tooling & Integration Map for Topology spread constraints (TABLE REQUIRED)

Row Details (only if needed)

Frequently Asked Questions (FAQs)

What is the default behavior if topologyKey is missing on nodes?

Can topology spread constraints ensure strict equal distribution?

Do managed Kubernetes offerings support topology spread constraints?

How do topology spread constraints interact with PodDisruptionBudgets?

Is topology spread constraints sufficient for stateful databases?

What happens during horizontal scaling?

Can topologyKey be a custom label?

Should all workloads have topology spread constraints?

How to debug pods Pending due to spread constraints?

Does spread affect performance or latency?

How to test spread policy before production?

How to automate enforcement of spread defaults?

What are reasonable starting targets for SLOs?

Can anti-affinity be used instead of spread?

How to avoid metric cardinality explosion when measuring skew?

Do nodes in different AZs always have correct topology labels?

How to handle legacy workloads without Pod templates that support spread?

Conclusion

Appendix — Topology spread constraints Keyword Cluster (SEO)

Leave a Comment Cancel reply