DEV Community: Todea

Kubecost Explained: Kubernetes FinOps That Moves the Bill

Ivan Porta — Tue, 19 May 2026 01:38:37 +0000

Most platform teams are familiar with Kubernetes costs. The monthly cloud bill arrives, finance asks why it’s higher, and engineering can only respond with “more workloads.” This gap between what the bill shows and what platform teams can explain is exactly what FinOps aims to address to help optimize operational costs. The real question is whether you need an enterprise platform for this. For most teams, the answer is no. OpenCost and Kubecost can give platform teams the visibility they need, as long as the tool is paired with an operating cadence.

The pressure is real

Kubernetes accounting is no longer just about a single cluster or a single cloud. Most teams now manage fleets of clusters, often across multiple cloud providers, and sometimes combine on-premises control planes with managed services. Containers move between nodes, nodes move between zones, and the same workload might run in several regions to meet latency or compliance needs, making the cost attribution even harder.

Industry data has shown the same problem for years. Back in 2021, a CNCF FinOps survey found that most teams couldn’t reliably measure their Kubernetes spending, with over-provisioning and lack of accountability as the main issues. The same story happened in 2025, with a fleet-telemetry benchmark using real cluster data from over 2,100 organizations on AWS, GCP, and Azure, showing average CPU utilization at 10% and memory utilization at 23%. These numbers come from production telemetry, not just survey responses. The problem hasn’t changed in four years; if anything, it’s become clearer as fleets have grown.

The real issue isn’t that Kubernetes is too expensive. It’s that most teams don’t have enough visibility into what they’re spending.

What Kubecost actually is

Kubecost is a Kubernetes platform for cost allocation, optimization, and governance. After IBM acquired it in 2024, Kubecost now offers both open-source and enterprise versions. At its core, it sits on top of OpenCost, and runs as an in-cluster agent stack with several microservices: data collectors, a cloud-cost ingestor, a forecasting service, a per-node network-costs DaemonSet, and a fast ClickHouse-backed aggregator. By combining Kubernetes telemetry with cloud-provider billing data, Kubecost provides detailed cost allocation views, and accuracy improves when cloud billing integrations are configured and reconciled.

The platform is built around three primary pillars:

Cost allocation rolls spend up by namespace, label, service, workload, or Collection. Kubecost v3 adds a grouping concept that bundles Kubernetes and external cloud-side costs into a single deduplicated unit. Costs are tracked across CPU, memory, persistent volumes, GPUs, and network traffic.
Optimization recommendations suggest right-sized requests based on real usage, propose cheaper node types, and let teams configure quantile-based controls instead of accepting a one-size default. Recommendations can be archived for historical reference and exported as CSV or PDF.
Alerts and governance ship as configurable budget actions, scheduled reports, and Slack/email notifications. Alerts live next to the budget they belong to rather than as a separate alerting subsystem.

How a Kubecost install flows

Cost data collection begins inside your clusters. When it starts, the FinOps agent sets up a watch on the Kubernetes API for the pricing ConfigMap, so any custom pricing rules can be applied without a restart. It also resolves node pricing per node (falling back to a computed value when the cloud provider's price is unavailable). It then collects metrics from each Network Costs DaemonSet pod, creates a new binary snapshot, and, if configured, writes it to external storage such as Azure Blob Storage or AWS S3.

The Network Costs DaemonSet subscribes to the kernel's conntrack table via a netlink socket, parsing each flow per-direction byte and packet counters, and maintains an in-memory map of pod, node, service, and endpoint state via Kubernetes API watches. It uses this map to link observed connections to specific workloads.

While these agents send internal snapshots to shared storage, the Cloud Cost Ingestor manages external financial data. It runs on a schedule, connects to cloud provider billing exports, pulls daily CSVs, and backfills historical data. Because cloud providers release billing data with a several-hour delay, Kubecost reconciles cluster data with cloud billing after a short wait. This means the most recent day or two of cost data is only an estimate, while older data is fully reconciled (assuming a working cloud-billing integration).

The Aggregator is the main engine and uses an embedded ClickHouse database. In a multi-cluster deployment, it fans in snapshots from several agent clusters. It checks multiple ConfigMaps for configuration and falls back to defaults when none are present. It ingests agent snapshots and, when configured, external billing CSVs, then drives them through a multi-stage SQL pipeline that reconciles and de-duplicates overlapping costs and produces the final cost tables that other microservices consume. The Aggregator also manages data retention by setting per-table, per-resolution TTLs in ClickHouse, so fine-grained windows expire within days while rollups are kept for weeks or months.

Finally, the Forecasting Service serves as a predictive cost-monitoring tool, using this data to generate cost forecasts. At the same time, the Cluster Controller uses the Aggregator’s optimization insights to take actions, such as applying right-sizing recommendations directly in the cluster.

Allocate to a real owner first

Kubecost spreads node costs across the pods running on each node, typically weighted by resource requests, and rolls the result up by namespace, label, service, workload, or Collection, covering CPU, memory, PV, GPU, and network.

What makes this approach effective is good organization, not just technical setup. Each namespace should match a real owner, like a team, product, or department. When this mapping is in place, the allocation view shows which team is responsible for each cost, without needing a spreadsheet. Teams that already use labels like team, cost-center, or product can use these for the same purpose, and Collections help make label-based views easy to use.

Labels can change over time, namespaces can increase, and someone will eventually deploy into default. Reviewing the unlabeled bucket each week helps keep the data accurate and useful.

Right-size requests against actual usage

This is the lever that moves the bill the most. Often, developers set CPU and memory requests defensively, never revisit them, and the gap between request and use shows up directly on the invoice. The 10%-CPU / 23%-memory benchmark cited above is a useful authority anchor when finance asks for a number.

A pattern that reliably finds savings: plot requested vs. actual CPU and memory per workload over a few weeks, then walk each workload's request down to what it actually uses. One practical case from a service mesh deployment had proxy sidecars set to 100 millicores each. The node could host roughly 200 pods on paper, but the scheduler exhausted allocatable CPU at around 90 pods because every pod carried a 100 millicores sidecar request on top of its own. After the request-rightsizing pass, pod density per node tripled with no application change and no node fleet change.

Kubecost 3.0 makes this loop tighter. Container Request Sizing Insights show usage visualizations directly in the UI, recommendations can be archived, CSV/PDF exports include labels, and quantile-based controls let you set tighter recommendation percentiles for predictable services and looser ones for bursty workloads. The enterprise tier adds an Automated Container Request Sizing UI that operates across clusters with custom profiles, suspension controls, audit history, and a comparison between recommended and realized savings; the open-source tier gets a free allowance up to 250 cores on EKS primary clusters.

Using a percentile-based recommendation policy is usually the most effective approach in production.

Set the CPU request to the 90th percentile of actual CPU usage from the past week or month, and then add a safety margin. The Kubernetes VPA default is about 15% for CPU. Because CPU is time-shared, the kernel lets bursty workloads use extra capacity when it is available. Adding more padding for rare spikes usually just increases requests without much benefit.
Set the memory request to a high percentile of peak usage, and then add a safety margin. Memory is not time-shared, so going over the limit can cause OOM kills instead of graceful degradation. Aim for about the 90th percentile of peaks, then add a margin. The VPA default is about 20% for memory.

These settings decide the QoS class. A pod is considered Guaranteed only if every container has CPU and memory requests set equal to their limits. If this is not the case for any container, the whole pod is treated as Burstable, or as BestEffort if no container sets any requests or limits. This setup works well for most workloads. Reserve the fully Guaranteed class for critical workloads with strict latency SLAs, where CPU and memory requests match their limits. In those cases, you might waste some headroom, but you get the best eviction protection and, if needed, exclusive CPU pinning.

If you want an additional feedback before automating any of this, you can run VPA in recommendation-only mode for several weeks, or use KRR open-source. Comparing recommendations across KRR, VPA-recommendation-mode, and Kubecost is more reliable than trusting any single tool's number.

Capacity-versus-request as the North Star metric

The ratio that tells you most about cluster efficiency is total pod requests ÷ total node allocatable capacity, across CPU and memory: how much of what you pay for is even claimed by a pod. It is also what Kubecost's request right-sizing is built on. Kubecost ships with several targets against which recommendations are computed: Production 0.65, Development 0.80, High Availability 0.50 (Cluster Right-Sizing API). Below it you carry capacity nothing asks for; above it you've spent the headroom that cluster class should keep. Kubecost also picks the utilization it sizes against by context; development the trending 85th-percentile, production the 98th, HA the 99.9th; and only on a one-day window; longer windows use maximum usage. The often-quoted "85th percentile" is just the development one-day default, not a universal setting.

The ratio is what you watch; an autoscaler (Karpenter, Cluster Autoscaler) moves it — but only if requests are honest, which is why Kubecost's request right-sizing sits upstream of any autoscaling story. The autoscaler reacts to requests; Kubecost tells you whether they reflect reality.

Recent days are directional by design: reconciliation needs a full day of billing data, so for a roughly 48-hour window costs stay at public on-demand pricing unless a node is provably not on-demand; Spot is accurate sooner only through a separately configured AWS Spot data feed (Cloud Billing Integrations). Read efficiency — independent of reconciled pricing — separately from cost.
Node Group Sizing — formerly Cluster Right-Sizing, rebuilt in v3.0 — turns this into an action: it analyzes in-cluster CPU, RAM, and GPU utilization against node capacity over a configurable window and recommends, per node group, changing the node count or switching the instance type. It runs from a preset profile or a custom metric — usage.max/p95/p85/avg or request.max/avg — with a target-utilization threshold per resource, never below average requested resources. It detects node groups by each provider's standard label, so it works across EKS, AKS, and GKE without setup (v3.x docs).

Find the always-on workloads that don't need to be

Once you’ve handled allocation and right-sizing, look at workloads that run all day, every day, even when they don’t have to. In one platform team’s review, 31% of workloads used less than 25% CPU for almost the entire day, yet Kubernetes costs still went up by about 18% over the year. This happened because engineers spent a lot of time tuning capacity and dealing with alerts, and because each team set up its own autoscaling rules differently.

The triage falls into three buckets. Production services that are genuinely over-spec’d belong in the right-sizing loop above. Non-production environments — dev, integration, demo — rarely need to run on weekends or overnight; a scheduled scale-to-zero is the highest-ROI change in this category. Batch and stateless workloads with retry tolerance are candidates for Spot instances, which trade roughly a 90% discount for a two-minute interruption notice.

Kubecost helps you find underused workloads. With Kubecost 3.0’s Advanced Filters, you can quickly sort workloads by namespace, label, or service using AND/OR conditions right in the UI, instead of having to do it elsewhere.

Track commitment coverage and utilization separately

Reserved capacity and savings commitments are common sources of unnecessary cloud costs. Teams often either ignore them and pay full on-demand prices, or buy them and forget to check if they are being used, leaving discounts unused on resources that are no longer needed. There are two important metrics to watch, and they are easy to mix up:

Coverage means the portion of your regular usage that is protected by a commitment.
Utilization is how much of your commitment you actually use.

Each cloud provider has different tools, but they all fit into three main types, and the calculations work the same way everywhere:

Mechanism	Typical max discount vs on-demand	Commitment
Flexible spend commitment	~60–66%	1 or 3 yr; hourly $ commitment; applies across families/regions
Instance-specific reservation	~55–72%	1 or 3 yr; locked to a region + instance family/SKU
Spot / preemptible	up to ~90%	none; interruption notice from ~30 sec to ~2 min

A good approach is to aim for commitment utilization between 80% and 95%, instead of trying to reach 100%. Going for 100% leaves no room for normal changes, like removing unused instances, changing instance types, or handling a drop in traffic. It may look efficient in a quarterly review, but it can cause problems day-to-day. For coverage, aiming for 60% to 75% is reasonable. This range is high enough to get a good discount, but low enough to allow for changes each quarter. These ranges are based on practical experience, not rules set by the cloud provider.

With Kubecost, costs are first estimated using public on-demand cloud provider prices until the actual cloud bill is ready. When the bill becomes available, usually within about 48 hours, Kubecost updates its estimates with the real costs. This update includes Reserved Instances, Savings Plans, committed-use discounts, and Spot pricing, along with any special rates you might have, such as Enterprise Discount Programs.

When to use a commercial FinOps platform instead

Most teams should start with the open-source chart. You can look at the commercial tiers later, once your needs grow.

Capability / Feature	Open-source Kubecost	Commercial FinOps platform
Cost allocation (namespace, label, workload)	✓	✓
Optimization recommendations (right-sizing)	✓ (manual application)	✓ + automated application across clusters
Cloud-billing reconciliation	✓ (basic)	✓ + EDP / RI / custom-discount aware
Multi-cluster aggregation	Manual / federation	✓ (built-in)
SSO, RBAC, audit log	Limited	✓
History retention	Limited by your storage layer	Long-term, vendor-managed
Collections (cloud + K8s dedup)	✓ (3.x)	✓
Automated Container Request Sizing UI	✕ (free tier limited)	✓
Quantile-based recommendation controls	✓ (3.x)	✓
Advanced filters (AND/OR)	✓ (3.x)	✓
Support / SLA	Community	Vendor SLA

If you just need per-namespace allocation, basic recommendations, and Slack alerts for a few clusters, the open-source version is enough. But if you manage many clusters across different clouds, need automated fixes, want vendor-managed history, or need to give your finance team detailed, reconciled discount numbers, then a commercial platform is worth it. The decision should be based on these needs, not just on how the dashboard looks.

Operational reality

ClickHouse and a unified agent replace the old stack. In v3, the 2.x DuckDB store is replaced with a ClickHouse database. This change makes allocation and cloud-cost API queries much faster and more reliable at scale. It also removes the need for Prometheus, which cuts down on memory use and makes deployment easier, while still providing OpenCost-standard metrics.
History is a deliberate choice, not a default. Whatever the storage backend, the retention window is the upper bound on the period-over-period reporting you can produce. Monthly reporting requires at least 30 days; year-over-year requires a year. Tier cold data to object storage if on-cluster retention gets expensive faster than the engineering time to set up the tiered pipeline.
Reconciliation lag is structural. The 24–48 hour billing-reconciliation delay is a property of cloud-provider billing exports, not of Kubecost. Build the operating model around it: argue about last week, not yesterday.
Multi-cluster needs a story. Open-source Kubecost can federate across clusters, but the experience is rougher than the commercial multi-cluster aggregator. Beyond five or six clusters, decide early whether to run per-cluster Kubecost and aggregate externally — into your own warehouse, for example — or pay for the commercial multi-cluster path. Either is defensible; drifting between the two is not.
The EKS add-on offers a quick way to get started. The Kubecost v3 free tier has a $100k USD spend limit over 30 days, while the Amazon EKS optimized Kubecost bundle is listed by AWS as exempt from that spend limit.
The operating model is the deliverable. If a team installs Kubecost but does not set up regular reviews, they will drift just like a team without any FinOps tools. The standard approach is to have a small FinOps group, such as a platform engineer, a finance analyst, and an SRE on rotation, meet each week to review the capacity-versus-request ratio, identify the most over-provisioned workloads, and check any namespace with a significant change in monthly cost. For smaller teams, a 30-minute review every two weeks with the platform engineer and CTO can achieve the same results.

A practical recommendation

If you are considering a FinOps approach for a Kubernetes platform and do not have a contractual obligation to choose a commercial product, start by piloting open-source Kubecost 3.x. Installation can be completed in an afternoon. Assign at least one namespace to a designated owner, provide a request-versus-usage dashboard to one team for two weeks, and share the capacity-versus-request ratio in a channel visible to the platform team. If regular reviews of these metrics become routine, you have achieved FinOps. If not, adopting a commercial platform will not resolve the underlying issues.

What Platform Teams Can Expect From Crossplane v2.2

Ivan Porta — Tue, 05 May 2026 05:21:20 +0000

A developer submits a ticket to request a database. Three days later, the platform team responds, but the configuration isn’t quite right. The developer files another ticket and waits again. By the third week, the database might finally be ready. This cycle repeats across every team and environment, creating a daily reality that most platform teams recognize: developers lose significant time waiting for infrastructure, while platform teams struggle to keep up with the constant flow of requests.

There are plenty of good tools for provisioning. Terraform is platform-agnostic and widely used. CloudFormation is the go-to option on AWS, and every cloud provider offers its own console and CLI. Each tool does its job well. This article isn’t about choosing the best one. Instead, it looks at the problem from a different perspective.

Crossplane offers a Kubernetes-native approach to managing cloud resources. Instead of setting up infrastructure outside the cluster and then linking it back, Crossplane brings that infrastructure under the same control loop as your applications. This fits naturally with how many teams already work. Paired with GitOps, a pull request becomes the primary way to manage changes, and the cluster continuously reconciles toward the desired state.

The project has moved quickly over the past year. In August 2025, Crossplane v2 introduced big changes, like removing Claims and adding namespaced composite and managed resources. The latest release, v2.2, adds an alpha Pipeline Inspector for troubleshooting, broader CEL validation, and more improvements.

What Crossplane actually is

Crossplane is a control plane framework for platform engineering. You install it into a Kubernetes cluster, known as the management cluster, and that cluster becomes the control plane for everything outside it: cloud accounts, SaaS APIs, internal tools, and even other Kubernetes clusters. All of these are managed through the same Kubernetes API your applications already use. The management cluster itself must be set up separately; Crossplane does not create it for you. Once Crossplane is running, there is no state file and no separate workflow. Drift is fixed by the same reconciliation loop that keeps your Deployments healthy.

Crossplane has four major components. You can use all four or only the ones you need.

Managed resources (MRs) map directly to external cloud resources in Kubernetes. For example, an S3 from AWS or a ResourceGroup from Azure is considered an MR. Crossplane uses spec.forProvider as the main reference and keeps the actual cloud resource in sync with it. You create MRs using kubectl, and the provider handles provisioning and reconciliation.
Composition lets you create custom APIs using a function pipeline. There are three main parts to understand:
- A CompositeResourceDefinition (XRD) defines a schema. It tells Kubernetes, “here’s a new custom API kind I’m creating, and these are its fields.” You can think of it as a CRD with added features for Crossplane.
- A Composition acts as a recipe. It says, “when someone creates an XR of kind Foo, run this set of functions to create these MRs or other Kubernetes resources.” In version 2, this always uses a function pipeline.
- A Composite Resource (XR) is an instance of the API you defined with an XRD. When a user creates an XR, Crossplane uses the matching Composition’s pipeline to generate the needed resources. You can write functions in YAML, KCL, Python, or Go.
Operations run function pipelines to completion, similar to a Kubernetes Job. There are three modes: Operation (one-time), CronOperation (scheduled), and WatchOperation (event-driven). Operations are currently in alpha.
The package manager handles installing and updating providers, configurations, and functions.

How a Crossplane request flows

There are two entry points into this flow, depending on what you're applying.

When a developer or a pipeline creates an XR in a namespace, the composition engine watches it, runs the configured function pipeline, and creates the needed resources. These resources can be other Kubernetes resources, managed resources, or both.

When a user applies an MR directly, either by itself or as part of a Composition, the provider takes over. It monitors the MR through the Kubernetes API, calls the external system to create or update the real resource, and updates the status. After that, it keeps checking: if the real resource changes from spec.forProvider, the provider fixes it. All state is stored in etcd, so there is no separate state file.

When to use traditional IaC instead

Crossplane and tools like Terraform or CloudFormation overlap in scope (both can provision a cloud database) and differ in how. The right choice depends on where your platform already lives.

Capability / Feature	Terraform	CloudFormation	Crossplane
Control loop	Manual `apply` (or pipeline)	Manual stack create/update	Continuous reconciliation
Drift handling	Detect with `plan`; correct manually	Detect drift action; correct via stack update	Detected and corrected automatically
State	`tfstate` in a remote backend you secure (e.g., S3 with versioning, HCP Terraform)	AWS-managed (server-side)	Kubernetes API objects in the management cluster's etcd
Workflow	Separate from app deployment	Separate from app deployment	Same as `kubectl apply`
Composition	Modules	Nested stacks, Modules	XRDs + Compositions + functions
Languages	HCL, JSON	YAML, JSON	YAML, Go, Python, KCL, CUE, HCL (via composition functions)
Built-in policy	Variable validation and pre/postconditions (OSS); Sentinel and OPA integration in HCP Terraform / Enterprise	cfn-guard, Hooks	XRD CEL validations (incl. metadata in v2.2)
Multi-cloud	Provider per cloud, separate state	AWS-first (third-party types via the CloudFormation registry)	One control plane, one API surface
Footprint	A binary	AWS-managed service (CLI/SDK only)	A Kubernetes control plane (Crossplane core, providers, functions) backed by etcd
Operates outside Kubernetes	✓	✓	✕ — requires a management cluster

If your team does not use Kubernetes, Crossplane is not the best place to start. Terraform is simpler and does not need a control plane. But if you are on Kubernetes, especially if you already use Argo CD or Flux, it is easy to manage your infrastructure in the same way. Crossplane is the closest option for writing infrastructure as code and handling it like the rest of your declarative cluster state.

What new with v2.2

v2.2 adds five things you'll notice in practice and one that quietly improves reliability. Each one closes a specific gap that platform teams have been hitting in production.

Pipeline inspector (alpha): Composition functions are powerful, but they have always been hard to debug. If a pipeline acts up on a running control plane, you could only see what each function got and returned by writing tests, running crossplane render locally, or adding your own instrumentation. v2.2 adds the pipeline inspector. When you turn on the feature flag, the Crossplane controller intercepts every RunFunctionRequest and RunFunctionResponse and forwards them over gRPC to a Unix socket you set up. A sidecar read from this socket and handle the data however you need: stream it to stdout during development or send it to an audit pipeline in production. To use it, add --enable-pipeline-inspector to Crossplane. The default socket path is /var/run/pipeline-inspector/socket, but you can change it with --pipeline-inspector-socket.

  # Enable the pipeline inspector feature flag
  args:
    - --enable-pipeline-inspector
    - --pipeline-inspector-socket=/var/run/pipeline-inspector/socket

  # Inject the pipeline inspector sidecar
  sidecarsCrossplane:
    - name: pipeline-inspector
      image: xpkg.crossplane.io/crossplane/inspector-sidecar:v0.0.3
      args:
        - --socket-path=/var/run/pipeline-inspector/socket
        - --max-recv-msg-size=8388608  # 8MB
      volumeMounts:
        - name: pipeline-inspector-socket
          mountPath: /var/run/pipeline-inspector
      resources:
        requests: { cpu: 10m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }

  # Add the shared volume for Unix socket communication
  extraVolumesCrossplane:
    - name: pipeline-inspector-socket
      emptyDir: {}

  extraVolumeMountsCrossplane:
    - name: pipeline-inspector-socket
      mountPath: /var/run/pipeline-inspector

XRD validation outside spec: XRD validation outside x-kubernetes-validations, (which are Kubernetes' CEL-based validation rules) used to only work on fields under an XR's spec. If you wanted to enforce rules like "all Database names must start with db-", you had to use an external admission controller such as Kyverno, OPA/Gatekeeper, or a custom webhook. With v2.2, that restriction is gone. Now, you can write CEL rules outside of spec, and the API server enforces them at admission time.

  apiVersion: apiextensions.crossplane.io/v1
  kind: CompositeResourceDefinition
  metadata:
    name: databases.platform.example.org
  spec:
    group: platform.example.org
    names:
      kind: Database
      plural: databases
    versions:
      - name: v1alpha1
        served: true
        referenceable: true
        schema:
          openAPIV3Schema:
            type: object
            x-kubernetes-validations:
              - rule: "self.metadata.name.startsWith('db-')"
                message: "Database names must start with 'db-'"
            properties:
              spec:
                type: object
                properties:
                  region:
                    type: string

ImageConfig runtime for dependencies: A Crossplane package, including Providers, runs as a Deployment. To customize the Deployment, such as by adding service account annotations, pod labels, or container arguments, use a DeploymentRuntimeConfig and reference it from the package.

kind: Provider
spec:
  package: xpkg.crossplane.io/crossplane-contrib/provider-azure-network:v1.0.0
  runtimeConfigRef:
    name: azure-workload-identity

This approach works well when you install the package directly. However, Crossplane can also install packages as dependencies. In that case, you could not get Workload Identity or any other runtime customization onto providers installed as dependencies.

ImageConfig is a cluster-scoped resource that matches packages based on their image prefix, not on which Provider or Configuration object created them. In v2.2, a new field was added: spec.runtime.configRef. With this change, Crossplane applies the DeploymentRuntimeConfig to any package whose image is matched, regardless of how it was installed.

  apiVersion: pkg.crossplane.io/v1beta1
  kind: ImageConfig
  metadata:
    name: azure-workload-identity
  spec:
    matchImages:
      - prefix: xpkg.crossplane.io/crossplane-contrib/provider-azure-
      - prefix: xpkg.crossplane.io/crossplane-contrib/provider-family-azure
    runtime:
      configRef:
        name: azure-workload-identity

Every Azure family provider, whether installed directly or added as a dependency, receives the runtime config.

RequiredSchemas for functions: Composition and functions sometimes need the OpenAPI schema of a resource to validate inputs, make schema-aware decisions, or generate resources dynamically. Before v2.2 you could ask Crossplane for the corresponding CRD as a RequiredResource, parse it, and extract the schema yourself, but only for custom resources, since built-in kinds like Deployment don't have a CRD. v2.2 introduces RequiredSchemas on the RunFunctionResponse thich returns the schema for any kind, built-in or custom.
crossplane beta trace improvements: You can now pass a kind (and optionally a namespace) instead of a single resource and get the dependency tree for every instance. And --watch (alias -w) keeps the output live, the way kubectl get -w does.
Function packages no longer install bundled CRDs: CRDs included in a function package are not applied to the cluster anymore. Also, packages with unknown or disallowed kinds now install successfully and simply skip those objects. Previously, the install would fail in these cases.
The package cache layout has changed: Cache filenames now come from the package’s OCI source and digest instead of the PackageRevision’s Kubernetes name. This change affects side-loading used by some provider e2e suites.

Operational reality

The management cluster is your state. Crossplane does not use an external state file. All XRDs, Compositions, XRs, and managed resources are stored in the management cluster’s etcd. If you lose that cluster without backups, your cloud resources keep running, but Crossplane loses track of them and stops reconciling. Silent drift can build up. Treat the management cluster like any production-critical Kubernetes cluster: use a highly available control plane, back up etcd, and avoid running it on your laptop. Local k3s or kind clusters are fine for learning, demos, or the Get Started guide, but not for important state. This is the trade-off for not having a Terraform state file: you solve one operational problem but gain another that is easier to overlook.
Upgrade through v2.1, not directly. Crossplane performs CRD migrations with each minor version upgrade, so skipping versions can cause you to miss important migrations. If you are on v1.x, use the Crossplane v2 upgrade guide. If you are on v2.1, upgrade directly to v2.2.
v1.20 is not yet end-of-life. v1.20 is still supported and has not reached end-of-life yet. However, since you are on a maintenance-only branch, it’s a good time to start planning your upgrade to v2.x.
Pipeline inspector is alpha. The flag is off by default and the contract may still change. Sidecar image versioning is also not stable yet. Try it in development, since function pipelines are much easier to understand when you can see them, but do not add it to your incident-response runbook yet.
Namespaced MRs are not yet universal. AWS managed resources are fully namespaced. The Upbound Azure provider, which is widely used, and GCP are currently rolling out this feature.
v2 removed several things. Native patch-and-transform composition, the ControllerConfig type, external secret stores, composite resource connection details, and the default registry for packages are no longer available. Most users can upgrade without breaking changes, but if you use these features, you will need to do some cleanup. Before upgrading, run kubectl get pkg and make sure every package uses a fully qualified image, such as registry.example.com/repo/package:tag.

A practical recommendation

If you are considering a control plane for your Kubernetes platform and do not have a strong reason to stick with Terraform, try Crossplane v2.2 first. The Get Started guide can be completed in an afternoon on any Kubernetes cluster. If Crossplane meets your needs, you can manage both application and infrastructure workflows with one declarative model. If not, you will have a clear, documented reason to keep your current tools.

If you are already using Crossplane v2.1, upgrade to v2.2. Features like server-side apply on the MRD controller, dependency-aware runtime config, schema access for functions, and better trace output are valuable even if you do not use the pipeline inspector. If you are still on v1.x, pin to v1.20, migrate any deprecated features, then upgrade to v2.x and continue from there. v2 offers good backward compatibility, but the deprecations are real.