|
| 1 | +## OpenTelemetry API (Tracing) |
| 2 | + |
| 3 | +[OpenTelemetry](https://opentelemetry.io) (OTel in short) provides a vendor-neutral API that allows to capture tracing, logs and metrics data. |
| 4 | + |
| 5 | +Agents MAY provide a bridge implementation of OpenTelemetry Tracing API following this specification. |
| 6 | +When available, implementation MUST be configurable and should be disabled by default when marked as `experimental`. |
| 7 | + |
| 8 | +The bridge implementation relies on APM Server version 7.16 or later. Agents SHOULD recommend this minimum version to users in bridge documentation. |
| 9 | + |
| 10 | +Bridging here means that for each OTel span created with the API, a native span/transaction will be created and sent to APM server. |
| 11 | + |
| 12 | +### User experience |
| 13 | + |
| 14 | +On a high-level, from the perspective of the application code, using the OTel bridge should not differ from using the |
| 15 | +OTel API for tracing. See [limitations](#limitations) below for details on the currently unsupported OTel features. |
| 16 | +For tracing the support should include: |
| 17 | +- creating spans with attributes |
| 18 | +- context propagation |
| 19 | +- capturing errors |
| 20 | + |
| 21 | +The aim of the bridge is to allow any application/library that is instrumented with OTel API to capture OTel spans to |
| 22 | +seamlessly delegate to Elastic APM span/transactions. Also, it provides a vendor-neutral alternative to any existing |
| 23 | +manual agent API with similar features. |
| 24 | + |
| 25 | +One major difference though is that since the implementation of OTel API will be delegated to Elastic APM agent, the |
| 26 | +whole OTel configuration that might be present in the application code (OTel processor pipeline) or deployment |
| 27 | +(env. variables) will be ignored. |
| 28 | + |
| 29 | +### Limitations |
| 30 | + |
| 31 | +The OTel API/specification goes beyond tracing, as a result, the following OTel features are not supported: |
| 32 | +- metrics |
| 33 | +- logs |
| 34 | +- span events |
| 35 | +- span links |
| 36 | + |
| 37 | +### Spans and Transactions |
| 38 | + |
| 39 | +OTel only defines Spans, whereas Elastic APM relies on both Spans and Transactions. |
| 40 | +OTel allows users to provide the _remote context_ when creating a span, which is equivalent to providing a parent to a transaction or span, |
| 41 | +it also allows to provide a (local) parent span. |
| 42 | + |
| 43 | +As a result, when creating Spans through OTel API with a bridge, agents must implement the following algorithm: |
| 44 | + |
| 45 | +```javascript |
| 46 | +// otel_span contains the properties set through the OTel API |
| 47 | +span_or_transaction = null; |
| 48 | +if (otel_span.remote_contex != null) { |
| 49 | + span_or_transaction = createTransactionWithParent(otel_span.remote_context); |
| 50 | +} else if (otel_span.parent == null) { |
| 51 | + span_or_transaction = createRootTransaction(); |
| 52 | +} else { |
| 53 | + span_or_transaction = createSpanWithParent(otel_span.parent); |
| 54 | +} |
| 55 | +``` |
| 56 | + |
| 57 | +### Span Kind |
| 58 | + |
| 59 | +OTel spans have an `SpanKind` property ([specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind)) which is close but not strictly equivalent to our definition of spans and transactions. |
| 60 | + |
| 61 | +For both transactions and spans, an optional `otel.span_kind` property will be provided by agents when set through |
| 62 | +the OTel API. |
| 63 | +This value should be stored into Elasticsearch documents to preserve OTel semantics and help future OTel integration. |
| 64 | + |
| 65 | +Possible values are `CLIENT`, `SERVER`, `PRODUCER`, `CONSUMER` and `INTERNAL`, refer to [specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#spankind) for details on semantics. |
| 66 | + |
| 67 | +By default, OTel spans have their `SpanKind` set to `INTERNAL` by OTel API implementation, so it is assumed to always be provided when using the bridge. |
| 68 | + |
| 69 | +For existing agents without OTel bridge or for data captured without the bridge, the APM server has to infer the value of `otel.span_kind` with the following algorithm: |
| 70 | + |
| 71 | +```javascript |
| 72 | +span_kind = null; |
| 73 | +if (isTransaction(item)) { |
| 74 | + if (item.type == "messaging") { |
| 75 | + span_kind = "CONSUMER"; |
| 76 | + } else if (item.type == "request") { |
| 77 | + span_kind = "SERVER"; |
| 78 | + } |
| 79 | +} else { |
| 80 | + // span |
| 81 | + if (item.type == "external" || item.type == "storage" || item.type == "db") { |
| 82 | + span_kind = "CLIENT"; |
| 83 | + } |
| 84 | +} |
| 85 | + |
| 86 | +if (span_kind == null) { |
| 87 | + span_kind = "INTERNAL"; |
| 88 | +} |
| 89 | + |
| 90 | +``` |
| 91 | + |
| 92 | +While being optional, inferring the value of `otel.span_kind` helps to keep the data model closer to the OTel specification, even if the original data was sent using the native agent protocol. |
| 93 | + |
| 94 | +### Span status |
| 95 | + |
| 96 | +OTel spans have a [Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) |
| 97 | +field to indicate the status of the underlying task they represent. |
| 98 | + |
| 99 | +When the [Set Status](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status) on OTel API is used, we can map it directly to `span.outcome`: |
| 100 | +- OK => Success |
| 101 | +- Error => Failure |
| 102 | +- Unset (default) => Unknown |
| 103 | + |
| 104 | +However, when not provided explicitly agents can infer the outcome from the presence of a reported error. |
| 105 | +This behavior is not expected with OTel API with status, thus bridged spans/transactions should NOT have their outcome |
| 106 | +altered by reporting (or lack of reporting) of an error. Here the behavior should be identical to when the end-user provides |
| 107 | +the outcome explicitly and thus have higher priority over the inferred value. |
| 108 | + |
| 109 | +### Attributes mapping |
| 110 | + |
| 111 | +OTel relies on key-value pairs for span attributes. |
| 112 | +Keys and values are protocol-specific and are defined in [semantic convention](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions) specification. |
| 113 | + |
| 114 | +In order to minimize the mapping complexity in agents, most of the mapping between OTel attributes and agent protocol will be delegated to APM server: |
| 115 | +- All OTel span attributes should be captured as-is and written to agent protocol. |
| 116 | +- APM server will handle the mapping between OTel attributes and their native transaction/spans equivalents |
| 117 | +- Some native span/transaction attributes will still require mapping within agents for [compatibility with existing features](#compatibility-mapping) |
| 118 | + |
| 119 | +OpenTelemetry attributes should be stored in `otel.attributes` as a flat key-value pair mapping added to `span` and `transaction` objects: |
| 120 | +```json |
| 121 | +{ |
| 122 | + // [...] other span/transaction attributes |
| 123 | + "otel": { |
| 124 | + "span_kind": "CLIENT", |
| 125 | + "attributes": { |
| 126 | + "db.system": "mysql", |
| 127 | + "db.statement": "SELECT * from table_1" |
| 128 | + } |
| 129 | + } |
| 130 | +} |
| 131 | +``` |
| 132 | + |
| 133 | +Starting from version 7.16 onwards, APM server must provide a mapping that is equivalent to the native OpenTelemetry Protocol (OTLP) intake for the |
| 134 | +fields provided in `otel.attributes`. |
| 135 | + |
| 136 | +When sending data to APM server version before 7.16, agents MAY use span and transaction labels as fallback to store OTel attributes to avoid dropping information. |
| 137 | + |
| 138 | +### Compatibility mapping |
| 139 | + |
| 140 | +Agents should ensure compatibility with the following features: |
| 141 | +- breakdown metrics |
| 142 | +- [dropped spans statistics](handling-huge-traces/tracing-spans-dropped-stats.md) |
| 143 | +- [compressed spans](handling-huge-traces/tracing-spans-compress.md) |
| 144 | + |
| 145 | +As a consequence, agents must provide values for the following attributes: |
| 146 | +- `transaction.name` or `span.name` : value directly provided by OTel API |
| 147 | +- `transaction.type` : see inference algorithm below |
| 148 | +- `span.type` and `span.subtype` : see inference algorithm below |
| 149 | +- `span.destination.service.resource` : see inference algorithm below |
| 150 | + |
| 151 | +#### Transaction type |
| 152 | + |
| 153 | +```javascript |
| 154 | +a = transation.otel.attributes; |
| 155 | +span_kind = transaction.otel_span_kind; |
| 156 | +isRpc = a['rpc.system'] !== undefined; |
| 157 | +isHttp = a['http.url'] !== undefined || a['http.scheme'] !== undefined; |
| 158 | +isMessaging = a['messaging.system'] !== undefined; |
| 159 | +if (span_kind == 'SERVER' && (isRpc || isHttp)) { |
| 160 | + type = 'request'; |
| 161 | +} else if (span_kind == 'CONSUMER' && isMessaging) { |
| 162 | + type = 'messaging'; |
| 163 | +} else { |
| 164 | + type = 'unknown'; |
| 165 | +} |
| 166 | +``` |
| 167 | + |
| 168 | +#### Span type, sub-type and destination service resource |
| 169 | + |
| 170 | +```javascript |
| 171 | +a = span.otel.attributes; |
| 172 | +type = undefined; |
| 173 | +subtype = undefined; |
| 174 | +resource = undefined; |
| 175 | + |
| 176 | +httpPortFromScheme = function (scheme) { |
| 177 | + if ('http' == scheme) { |
| 178 | + return 80; |
| 179 | + } else if ('https' == scheme) { |
| 180 | + return 443; |
| 181 | + } |
| 182 | + return -1; |
| 183 | +} |
| 184 | + |
| 185 | +// extracts 'host' or 'host:port' from URL |
| 186 | +parseNetName = function (url) { |
| 187 | + var u = new URL(url); // https://developer.mozilla.org/en-US/docs/Web/API/URL |
| 188 | + if (u.port != '') { |
| 189 | + return u.hostname; // host:port already in URL |
| 190 | + } else { |
| 191 | + var port = httpPortFromScheme(u.protocol.substring(0, u.protocol.length - 1)); |
| 192 | + return port > 0 ? u.host + ':'+ port : u.host; |
| 193 | + } |
| 194 | +} |
| 195 | + |
| 196 | +peerPort = a['net.peer.port']; |
| 197 | +netName = a['net.peer.name'] || a['net.peer.ip']; |
| 198 | + |
| 199 | +if (netName && peerPort > 0) { |
| 200 | + netName += ':'; |
| 201 | + netName += peerPort; |
| 202 | +} |
| 203 | + |
| 204 | +if (a['db.system']) { |
| 205 | + type = 'db' |
| 206 | + subtype = a['db.system']; |
| 207 | + resource = netName || subtype; |
| 208 | + if (a['db.name']) { |
| 209 | + resource += '/' |
| 210 | + resource += a['db.name']; |
| 211 | + } |
| 212 | + |
| 213 | +} else if (a['messaging.system']) { |
| 214 | + type = 'messaging'; |
| 215 | + subtype = a['messaging.system']; |
| 216 | + |
| 217 | + if (!netName && a['messaging.url']) { |
| 218 | + netName = parseNetName(a['messaging.url']); |
| 219 | + } |
| 220 | + resource = netName || subtype; |
| 221 | + if (a['messaging.destination']) { |
| 222 | + resource += '/'; |
| 223 | + resource += a['messaging.destination']; |
| 224 | + } |
| 225 | + |
| 226 | +} else if (a['rpc.system']) { |
| 227 | + type = 'external'; |
| 228 | + subtype = a['rpc.system']; |
| 229 | + resource = netName || subtype; |
| 230 | + if (a['rpc.service']) { |
| 231 | + resource += '/'; |
| 232 | + resource += a['rpc.service']; |
| 233 | + } |
| 234 | + |
| 235 | +} else if (a['http.url'] || a['http.scheme']) { |
| 236 | + type = 'external'; |
| 237 | + subtype = 'http'; |
| 238 | + |
| 239 | + if (a['http.host'] && a['http.scheme']) { |
| 240 | + resource = a['http.host'] + ':' + httpPortFromScheme(a['http.scheme']); |
| 241 | + } else if (a['http.url']) { |
| 242 | + resource = parseNetName(a['http.url']); |
| 243 | + } |
| 244 | +} |
| 245 | + |
| 246 | +if (type === undefined) { |
| 247 | + if (span.otel.span_kind == 'INTERNAL') { |
| 248 | + type = 'app'; |
| 249 | + subtype = 'internal'; |
| 250 | + } else { |
| 251 | + type = 'unknown'; |
| 252 | + } |
| 253 | +} |
| 254 | +span.type = type; |
| 255 | +span.subtype = subtype; |
| 256 | +span.destination.service.resource = resource; |
| 257 | +``` |
| 258 | + |
| 259 | +### Active Spans and Context |
| 260 | + |
| 261 | +When possible, bridge implementation MUST ensure proper interoperability between Elastic transactions/spans and OTel spans when |
| 262 | +used from their respective APIs: |
| 263 | +- After activating an Elastic span via the agent's API, the [`Context`] returned via the [get current context API] should contain that Elastic span |
| 264 | +- When an OTel context is [attached] (aka activated), the [get current context API] should return the same [`Context`] instance. |
| 265 | +- Starting an OTel span in the scope of an active Elastic span should make the OTel span a child of the Elastic span. |
| 266 | +- Starting an Elastic span in the scope of an active OTel span should make the Elastic span a child of the OTel span. |
| 267 | + |
| 268 | +[`Context`]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md |
| 269 | +[attached]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#attach-context |
| 270 | +[get current context API]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/context/context.md#get-current-context |
| 271 | + |
| 272 | +Both OTel and our agents have their own definition of what "active context" is, for example: |
| 273 | +- Java Agent: Elastic active context is implemented as a thread-local stack |
| 274 | +- Java OTel API: active context is implemented as a key-value map propagated through thread local |
| 275 | + |
| 276 | +In order to avoid potentially complex and tedious synchronization issues between OTel and our existing agent |
| 277 | +implementations, the bridge implementation SHOULD provide an abstraction to have a single "active context" storage. |
0 commit comments