-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Starfish Tracing Model #83
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!
We should probably add some notes about cardinality (in span name, span tags, extracted metrics, etc.), I feel like that's necessary information to make the best decisions around this.
provide at least the following pieces of information: | ||
|
||
* `op`: defines the core operation that the span represents (eg: `db.query`) | ||
* `description`: the most significant description of what this operation is (eg: the database query). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would heavily be in favor of re-naming this to name
.
For those unfamiliar we currently have an inconsistency in the sentry schema where transactions have a name
attribute and spans have a description
(functionality identical).
I think most users are familiar with name as an attribute (closer to otel and other vendors as well then), and it would require less delta for us to approach.
We even alias span descriptions to names in the UI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to get rid of description
as the primary field then we need to rethink how we do the UX around some of those pieces. We currently stash into description
what others for instance stash into http.route
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to get rid of description, simply just call it name
instead of description
. It can work the same exact way.
text/0083-starfish-tracing-model.md
Outdated
* `segment_id`: a span that is part of a segment, always refers back to it. | ||
* `is_segment`: when set to `true` this span is a segment. | ||
* `start_time` and `end_time` to give the span time information. | ||
* `tags`: a key/value pair of arbitrary tags that are set per-span. Conceptionally however spans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there are a couple ways to store arbitrary data on the existing schema:
on transactions you have tags, contexts and extras (inherting from error events)
on spans you have tags and data.
As per our spec, tags are restricted to string keys -> string values, and have strict size limits., while data is restricted to string keys, but can have arbitrary values (with more liberal size limits).
This has lead to most of the SDKs storing arbitrary values in span data, and the OpenTelemetry conversion setting span attributes (otels arbitrary span data) in the data field as well.
Span data is now also used for semantic conventions, similar to how opentelemetry semantic conventions use otel span attributes. I would like to keep this behaviour. For example, the span.data
with the key blocked_main_thread
is used in the file io performance issue detector.
I would like an arbitrary place to continue to store data to still exist, so that means we either move to saying this is data
, or we update the limitations on tags
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should be keeping this. I won't have time yet to focus on the span schema so if you want, maybe you can sketch this out directly on the RFC with what the schema should look like?
* `parent_id`: a span optionally points back to a parent trace which could be from a different | ||
service | ||
* `segment_id`: a span that is part of a segment, always refers back to it. | ||
* `is_segment`: when set to `true` this span is a segment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If segment is meant to be either a transaction or an interaction, should we build it into the spec here?
so segment: "transaction" | "interaction" | null
, where null says that the span is not a segment.
Co-authored-by: Mark Story <[email protected]> Co-authored-by: Abhijeet Prasad <[email protected]>
* Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator, | ||
preferrably a remote relay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which types of aggregation do you have in mind here ?
Is that the usual metrics extraction or something else?
|
||
* `op`: defines the core operation that the span represents (eg: `db.query`) | ||
* `description`: the most significant description of what this operation is (eg: the database query). | ||
The description also gets processed and cleaned in the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we expect to use this?
When we experimented with span search and when we built suspect spans one of the major issues was that all the useful information was stored unstructured in this field and all queries would have been regexes. It did not end well.
@nikhars , if this is going to be a searchable field you may want to look into ngram indexes.
establishing of a connection it could attach a `socket.connect_time` or similar | ||
numbers. Other examples might be things like `cache.result_size` etc. | ||
|
||
Measurements are ingested broken down by segment and span level tags before dynamic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned about breaking down by all tags and letting these be arbitrary.
We have plenty of unique values in tags today.
Spans represent currently on average 25x the volume of transactions. If the cardinality will be extreme (like unique values in tags to propagate) pre-aggregation will be worthless independently on the storage used, so we need to be careful here.
Ive not been able to review this in depth but I wanted to escalate the convo we're having in Slack. I do not believe a session should be a trace, or vice versa. To me these are different constructs. I, as a user, expect a trace to represent a request cycle. That is, when I take an action, a trace occurs, and it ends when that action fully completes. That said, there are problems in all models. To illustrate what I mean, I want to call out several cases that begin root traces:
When do these traces end?
My pov is all models are trash. You lose data if you make traces behave as expected, which is what I originally describe here. Traces were not designed for long lived applications and frankly speaking they do not work for it. Should we throw it all away, call it something else, and just go with it? e.g. we could just call it Session Tracing, explicitly say a trace generally speaking will live for an entire session? This has its own problems. I load a SPA, or Slack, and keep it open for days. All those requests are a single trace.
I'm sure there are more things I've forgotten to write up here, but the problem does not seem to have a right answer. |
If we're really considering a new model, IMHO this would be a good time to tackle the time synchronization problem. Timestamps aren't enough IMHO, because each clock that generates them is synchronized independently. It's all to easy for a "distributed trace" to show nonsensical timestamps when viewed as a whole, such as server responses that complete before the request has been sent. I have some ideas here we could experiment with. Would this be welcome as part of this new model? |
* capture entire traces | ||
* browser tabs = a trace | ||
* index and extract metrics on a span level | ||
* clients send 100% of metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this is in the TODO, but how does this proposal help with this "100% sending" or "memory-pressure/payload size" issue listed above? what about the current CPU overhead?
It is helpful for me to remember our own sampling limits, even after dynamic sampling has been implemented, are often set extremely low for performance reasons https://github.com/getsentry/sentry/blob/cc47128efd5a00e66b2475acc318ff57860fdf94/src/sentry/utils/sdk.py#L35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even with a local relay I don't understand how we could ever have metrics sampled enough to be useful for SLAs / Alerts where performance of the process is not severely degraded in medium throughput and above processes. It seems to me the only way to accomplish this is for metrics collection directly in the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah maybe this is what
extract metrics from spans pre-sampling.
is going to address?
Sorry for rather drive-by comments but since I have a vested interest in better UI tracing a couple thoughts. I want to note from experience that most SLO initiatives fail to meaningfully improve reliability in organizations. This is generally because SLOs are measured over individual services and endpoints, and that aggregation doesn't connect meaningfully to user experience. That's really the same problem @dcramer is calling out here. There's a lot of opportunity to build a better model, but I think you'd have to think somewhat platform specific. If I live just in React in my head (though I think conceptually this fits the whole web platform), the meaningful measurements I would want for my users are:
When you think about it - nearly all of these events will have some 'main' component/DOM element associated with them. Figuring out what that main component is is is hard because they nest, but logically:
All mark the start of a new transaction and most likely the end of the previous one. They also likely represent a logical 'feature' in terms an average product manager would understand. Server sent events are a wild-card because they represent an almost parallel thread to your main UI thread - but I do feel they could almost always be looked at separately as they are async and at least shouldn't represent real-time interaction (the chat scenario being the weird counterexample that must always exist - but then each chat exchange can likely be thought of as it's own transaction as well.) This might be possible through auto-instrumentation if you are willing to go into the frameworks (historically y'all are). I think you end up with sort of a 'transaction type' and then a 'transaction boundary' following this schema - where the 'transaction type' is 'route | suspense | button/form' and the boundary is the name of the associated thing (dom ID/name, component name, etc). |
You might find the OpenTelemetry Client Instrumentation SIG group interesting: Quick scan I didn't see Sentry mentioned in it yet. I think it would be great to consider @maggiepint You can have composite SLO and give different weights to the underlying ones, you can set the |
number of spans of a certain category per segment and have these counters be | ||
promoted into the segment. | ||
|
||
## Batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an interest in a tracing model that actually works for batched workloads.
I'm trying to figure out how a tracing story could work for arroyo, and so far it seems that one would both:
- want a transaction wrapped around a function that operates on a batch
- want to be able to correlate that transaction to the origin of each message in that batch
so that to me implies that one transaction may have multiple parent traces
that's the only solution i could make sense of, but it sucks for a lot of reasons, so I'm curious whether we have a better plan here
What happened to this? I experimented with visuals which wound up similar, but not identical to https://brycemecum.com/2023/03/31/til-mermaid-tracing/ that links here. early 2023 this opened and it seems inactive, but doesn't really point at why. |
This RFC proposes a new tracing model for Sentry's performance product to better
address future product improvements. The goal of this model is to allow
storing entire traces without gaps, support dynamic sampling, indexing
of spans and to extract metrics from spans pre-sampling. This is an
evolution of the current transaction based approach.
Rendered RFC