Skip to content

feat(metrics): Migrate from OpenCensus to OpenTelemetry#9043

Open
khrm wants to merge 7 commits intotektoncd:mainfrom
khrm:mig-otel
Open

feat(metrics): Migrate from OpenCensus to OpenTelemetry#9043
khrm wants to merge 7 commits intotektoncd:mainfrom
khrm:mig-otel

Conversation

@khrm
Copy link
Contributor

@khrm khrm commented Sep 29, 2025

Changes

  • Updated pipelinerunmetrics and taskrunmetrics to use OpenTelemetry instruments (histograms, counters, gauges) for creating and recording metrics.
    Introduced new OpenTelemetry configurations in config/config-observability.yaml for exporters and protocols..
    Rewrote the test suites for pipelinerunmetrics and taskrunmetrics to be compatible with the new OpenTelemetry-based implementation.
  • Updated knative to 1.19

fixes #8969

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Breaking Changes in metrics

  1. Infrastructure Metric Renaming
  Infrastructure metrics (Go runtime, Workqueue, K8s Client) have been renamed from the tekton_pipelines_controller_ prefix to standard OpenTelemetry/Knative namespaces.


 ┌────────────┬───────────────────────────────────────────────────────────────┬───────────────────────────────────────────────┬────────────────────────────────────┐
│ Category   │ Old Metric Name (OpenCensus)                                  │ New Metric Name (OpenTelemetry)               │ Changes                            │
├────────────┼───────────────────────────────────────────────────────────────┼───────────────────────────────────────────────┼────────────────────────────────────┤
│ Workqueue  │ tekton_pipelines_controller_workqueue_adds_total              │ kn_workqueue_adds_total                       │ Prefix change                       │
│            │ tekton_pipelines_controller_workqueue_depth                   │ kn_workqueue_depth                            │ Prefix change                       │
│            │ tekton_pipelines_controller_workqueue_queue_latency_seconds   │ kn_workqueue_queue_duration_seconds           │ Renamed latency -> duration        │
│            │ tekton_pipelines_controller_workqueue_work_duration_seconds   │ kn_workqueue_process_duration_seconds         │ Renamed work -> process            │
│            │ tekton_pipelines_controller_workqueue_retries_total           │ kn_workqueue_retries_total                    │ Prefix change                       │
│            │ tekton_pipelines_controller_workqueue_unfinished_work_seconds  │ kn_workqueue_unfinished_work_seconds           │ Prefix change                       │
│ K8s Client │ tekton_pipelines_controller_client_latency                    │ http_client_request_duration_seconds          │ Standard HTTP metric               │
│            │ tekton_pipelines_controller_client_results                    │ kn_k8s_client_http_response_status_code_total │ Detailed status tracking           │
│ Go Runtime │ tekton_pipelines_controller_go_*                              │ go_* (e.g. go_goroutines)                     │ Standard Prometheus Go collector   │
└────────────┴───────────────────────────────────────────────────────────────┴───────────────────────────────────────────────┴────────────────────────────────────┘


  2. Label Changes in Tekton Metrics
  While the core metric names for PipelineRuns and TaskRuns have been preserved, some labels have changed to provide more detail or align with OTel standards.

   * `tekton_pipelines_controller_pipelinerun_duration_seconds`:
       * Added: reason label (e.g., Completed, Succeeded). This allows for more granular breakdown of durations but increases cardinality.
   * `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds`:
       * Added: reason label.

  3. Removed Metrics
   * `tekton_pipelines_controller_reconcile_count` and `tekton_pipelines_controller_reconcile_latency`: These custom reconcile metrics are no longer emitted. Users should rely on
     kn_workqueue_process_duration_seconds and kn_workqueue_adds_total to monitor controller performance.

  What Remained Compatible
  The following critical metrics have been explicitly preserved to minimize disruption:
   * TaskRun Pod Latency: tekton_pipelines_controller_taskruns_pod_latency_milliseconds remains a Gauge (preserving behavior despite OTel defaults preferring Histograms).
   * Total Counters: tekton_pipelines_controller_pipelinerun_total and tekton_pipelines_controller_taskrun_total retain their original labels (status) and do not include the high-cardinality
     reason label.
   * Bucket Boundaries: Duration histograms (e.g., taskrun_duration_seconds) retain their specific explicit buckets (10s, 30s, 1m, etc.) instead of defaulting to OTel's millisecond-focused
     buckets.

 Upgrade Actions
   1. Update Config: Ensure your config-observability ConfigMap uses metrics-protocol: prometheus (or grpc/http) instead of the old metrics.backend-destination. If prometheus was being used, then no need to do any changes.
   2. Update Dashboards:
       * Replace tekton_pipelines_controller_workqueue_* queries with kn_workqueue_*.
       * Replace tekton_pipelines_controller_go_* queries with standard go_* metrics.
       * Check aggregations on pipelinerun_duration_seconds to account for the new reason label (use sum by (label) if necessary to aggregate away the new dimension).

/kind feature

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesnt merit a release note. labels Sep 29, 2025
@tekton-robot tekton-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 29, 2025
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@khrm khrm force-pushed the mig-otel branch 2 times, most recently from bab5a44 to a2ea129 Compare September 29, 2025 10:53
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@waveywaves
Copy link
Member

/retest

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@waveywaves
Copy link
Member

/retest

@vdemeester vdemeester added this to the v1.7.0 milestone Oct 1, 2025
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2025
@vdemeester vdemeester modified the milestones: v1.7.0, v1.8.0 Dec 3, 2025
@enarha
Copy link

enarha commented Dec 15, 2025

Re knative bump, we discussed multiple times that we want knative/pkg@04fdd0b included in the next bump, even though it only exists in main for now (not even in 1.20). It'll unlock the better finalizer management which proved to be an issue for Tekton deployments with multiple controllers managing the same PR/TR resources. So do we want to bump knative/pkg even higher?

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2025
@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

/assign @vdemeester @waveywaves

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

@enarha I have updated to latest knative/pkg for now.

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

Why are ci tests being skipped?

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

/kind feature

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2026
@khrm
Copy link
Contributor Author

khrm commented Jan 19, 2026

@twoGiants @afrittoli Can you guys please review this?

@waveywaves
Copy link
Member

/retest

@waveywaves
Copy link
Member

/test pull-tekton-pipeline-go-coverage-df

@khrm khrm force-pushed the mig-otel branch 3 times, most recently from 85c3c24 to 7df1705 Compare January 28, 2026 09:02
@divyansh42
Copy link
Member

@khrm I see in docs/metrics.md there is another metric named tekton_pipelines_controller_taskrun_duration_seconds_. If I understand this correctly, this will also have the reason label mandatory.
But we have not included in the Label Changes section in the Release notes. Can you please take a look?

@khrm
Copy link
Contributor Author

khrm commented Jan 29, 2026

@divyansh42 It's there in the release note, isn't it?

@divyansh42
Copy link
Member

@divyansh42 It's there in the release note, isn't it?

@khrm I just see two metrics tekton_pipelines_controller_pipelinerun_duration_seconds and tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds` which is included in label changes.

@vdemeester vdemeester modified the milestones: v1.9.0 (LTS), v1.10.0 Jan 29, 2026
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 30, 2026
@vdemeester vdemeester added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 30, 2026
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2026
khrm added 7 commits February 4, 2026 15:31
This upgrades knative to latest and other dependent dependencies.
Replaces the legacy knative.dev/pkg/metrics dependency with knative.dev/pkg/observability/configmap to align with the OpenTelemetry migration and ensure correct configuration loading.
This isn't supported after knative otel's migration.
The knativetest.Flags.EmitMetrics flag is no longer supported after
the migration to OpenTelemetry. This commit removes the conditional
initialisation to align with the new metrics system.
Run hack/update-codegen.sh to update generated client code, deepcopy functions, and CRD manifests. This aligns the generated code with recent dependency updates, including changes to finalizer management and context usage in informers.
Update the expected checksums in Pipeline and Task unit tests to match the new values generated after the knative update:
- pkg/apis/pipeline/v1
- pkg/apis/pipeline/v1beta1
This commit migrates the metrics for PipelineRuns and TaskRuns from
OpenCensus to OpenTelemetry.

  The following changes are included:
   - Updated the observability config to support OpenTelemetry.
   - Migrated the implementation of PipelineRun and TaskRun metrics to use
    the OpenTelemetry Go SDK.
   - Updated the tests to work with the new OpenTelemetry-based
    implementation.
@khrm
Copy link
Contributor Author

khrm commented Feb 4, 2026

e2e tests were failing due to this:

│ panic: conflicting Schema URL: https://opentelemetry.io/schemas/1.39.0 and https://opentelemetry.io/schemas/1.37.0                                                                          │

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Drop use of knative.dev/pkg/metrics