Skip to content

Operations

Metrics endpoint

The proxy emits OpenTelemetry metrics on a separate listener (default :9090) so scrape traffic is isolated from the forwarding path: no JWT is attached, no upstream is resolved, and proxy TLS is not applied.

curl -s http://localhost:9090/metrics | head

Metric names and attributes follow the OpenTelemetry HTTP semantic conventions (v1.26.0).

Instruments

http_server_request_duration_seconds

Histogram of inbound request latency. Attributes: http_request_method, http_response_status_code. Use the histogram's _count series to count requests; there is no separate request counter.

http_server_active_requests

Gauge of concurrent in-flight requests.

http_client_request_duration_seconds

Histogram of outbound upstream call latency. Attributes:

  • server_address, server_port (omitted when the port is the scheme default)
  • error_type (only on failures)
  • token_result - ok, fetch_error, or resolver_error - so a single panel can distinguish upstream failures from token-acquisition failures.

token_fetch_count_total

Counter of AWS STS GetWebIdentityToken calls. Attributes: audience (normalized comma-joined set), result (ok or error), and error_class on errors (AWS error code, or transport).

token_cache_hit_count_total / token_cache_miss_count_total

Counters keyed by audience (normalized comma-joined set).

token_cached_audiences

Gauge of distinct audience sets currently held in the token cache. Watch this under dynamic audience mode - the cache has no eviction.

Go runtime metrics

Standard go_* instruments.

Attribute rules

  • http_request_method uses the nine canonical HTTP methods literally; anything else collapses to _OTHER, per semconv.
  • http_response_status_code is the full integer status (no class bucketing); write PromQL like {http_response_status_code=~"5.."} for the class.
  • error_type values: timeout, connection_refused, unknown. The attribute is absent on success - do not query for error_type="success".
  • token_result values: ok on success, fetch_error when STS fails, resolver_error when the audience resolver fails. Always present on http_client_request_duration_seconds.
  • audience on token_* series is the normalized (sorted, deduped, comma-joined) audience set, so a,b and b,a share a time series.
  • Request path is never emitted as an attribute.

Every series carries scope metadata: otel_scope_name, otel_scope_version (populated from runtime/debug.ReadBuildInfo() - module version in release builds, VCS revision in dev, (devel) otherwise), and otel_scope_schema_url.

Failure modes

  • STS token fetch fails → proxy returns 502 Bad Gateway with body token unavailable. Upstream is not called.
  • Audience resolver fails (dynamic mode only, unusual) → proxy returns 502 Bad Gateway. Upstream is not called.
  • Upstream timeout (--upstream-timeout, default 30s) → proxy returns a 5xx and records error_type=timeout on http_client_request_duration_seconds.
  • Upstream connection refused / other transport failureerror_type=connection_refused or unknown.

Token cache behavior

Tokens are cached per audience set and reused until they near expiry. "Near expiry" is controlled by --token-refresh-skew (default 5m) relative to --token-duration (default 1h): a token is proactively refreshed when its remaining life drops below the skew.

Cache hits and misses are tracked by token_cache_hit_count_total and token_cache_miss_count_total. The first request per unique audience set incurs a single STS round-trip; subsequent requests are served from memory.

Endpoint safety

The metrics endpoint carries no authentication. Bind it to loopback or a private CIDR, or gate it with a NetworkPolicy. Set --metrics-enabled=false to disable it entirely.