Operations¶
Metrics endpoint¶
The proxy emits OpenTelemetry metrics on a separate listener (default :9090) so scrape traffic is isolated from the forwarding path: no JWT is attached, no upstream is resolved, and proxy TLS is not applied.
Metric names and attributes follow the OpenTelemetry HTTP semantic conventions (v1.26.0).
Instruments¶
http_server_request_duration_seconds¶
Histogram of inbound request latency. Attributes: http_request_method, http_response_status_code. Use the histogram's _count series to count requests; there is no separate request counter.
http_server_active_requests¶
Gauge of concurrent in-flight requests.
http_client_request_duration_seconds¶
Histogram of outbound upstream call latency. Attributes:
server_address,server_port(omitted when the port is the scheme default)error_type(only on failures)token_result-ok,fetch_error, orresolver_error- so a single panel can distinguish upstream failures from token-acquisition failures.
token_fetch_count_total¶
Counter of AWS STS GetWebIdentityToken calls. Attributes: audience (normalized comma-joined set), result (ok or error), and error_class on errors (AWS error code, or transport).
token_cache_hit_count_total / token_cache_miss_count_total¶
Counters keyed by audience (normalized comma-joined set).
token_cached_audiences¶
Gauge of distinct audience sets currently held in the token cache. Watch this under dynamic audience mode - the cache has no eviction.
Go runtime metrics¶
Standard go_* instruments.
Attribute rules¶
http_request_methoduses the nine canonical HTTP methods literally; anything else collapses to_OTHER, per semconv.http_response_status_codeis the full integer status (no class bucketing); write PromQL like{http_response_status_code=~"5.."}for the class.error_typevalues:timeout,connection_refused,unknown. The attribute is absent on success - do not query forerror_type="success".token_resultvalues:okon success,fetch_errorwhen STS fails,resolver_errorwhen the audience resolver fails. Always present onhttp_client_request_duration_seconds.audienceontoken_*series is the normalized (sorted, deduped, comma-joined) audience set, soa,bandb,ashare a time series.- Request path is never emitted as an attribute.
Every series carries scope metadata: otel_scope_name, otel_scope_version (populated from runtime/debug.ReadBuildInfo() - module version in release builds, VCS revision in dev, (devel) otherwise), and otel_scope_schema_url.
Failure modes¶
- STS token fetch fails → proxy returns
502 Bad Gatewaywith bodytoken unavailable. Upstream is not called. - Audience resolver fails (dynamic mode only, unusual) → proxy returns
502 Bad Gateway. Upstream is not called. - Upstream timeout (
--upstream-timeout, default30s) → proxy returns a 5xx and recordserror_type=timeoutonhttp_client_request_duration_seconds. - Upstream connection refused / other transport failure →
error_type=connection_refusedorunknown.
Token cache behavior¶
Tokens are cached per audience set and reused until they near expiry. "Near expiry" is controlled by --token-refresh-skew (default 5m) relative to --token-duration (default 1h): a token is proactively refreshed when its remaining life drops below the skew.
Cache hits and misses are tracked by token_cache_hit_count_total and token_cache_miss_count_total. The first request per unique audience set incurs a single STS round-trip; subsequent requests are served from memory.
Endpoint safety¶
The metrics endpoint carries no authentication. Bind it to loopback or a private CIDR, or gate it with a NetworkPolicy. Set --metrics-enabled=false to disable it entirely.