Application Insights Trace Sampling

Adaptive Sampling

Application Insights for .Net applications will by default choose "Adaptive Sampling", which by and large will do the job. If your application is doing very little, AI will send all the Telemetry to Application Inisights. When activitity picks up, the Adaptive sampler will scale back and only send every Nth sample. You can influence this behaviour somewhat using properties like MaxTelemetryItemsPerSecond and Max-/MinSamplePercentage. 

Under the hood, the AdaptiveSamplingProcessor uses a standard SamplingProcessor, but continually modifies the sample rate, so let's go down the rabbit hole..

Fixed Rate Sampling

As one might guess, this sampler uses a fixed percentage rate to decide when a TelemetryItem is chosen for sampling. If we say 5%, the sampler calculates a hash-code for the telemetry, and if it is less than 5, the sample is sent to Application Insights. In Application Insights 2.20 this hash value, is based on the OperationId, which just means that it's a random function of time and you will not be able to predict which items are chosen for sampling.

The function in question looks like this:

int hash = 5381;

for (int i = 0; i < input.Length; i++)
{
    hash = ((hash << 5) + hash) + (int)input[i];
}

and in the end, the result is converted to a percentage of int.MaxValue. I haven't tested the distribution of this thing, but on surface it looks like an even distribution as long as input.Lenght > 6. 

ITelemetryProcessor

By now it's necessary to introduce TelemetryProcessor. It defines the method Process(ITelemetry item); and the processor needs to be able to participate in a processor "chain". ie. it's instantiated with a pointer to the next processor in line. The Process-method would then be implemented something like this:

public void Process(ITelemetry item)
{
    if(SamplePercentage > GetTelemetryItemHash(item))
        NextProcessor.Process(item);
}

meaning, if GetTelemetryItemHash is greater than SamplePercentage, we short circuit the processing chain. That's all there's to it.

Adaptive Sampling part 2

The AdaptiveTelemetrySamplingProcessor wraps the fixed TelemetrySamplingProcessor and supplies it with a Next-processor, which is able to "taste" (because sample is a loaded word in this context) the current samplerate and is able to suggest a new one, based on a requirement to only sample MaxTelemetryItemsPerSecond items per second. If the actual number of samples is too high, then decrease sample rate. If too low (ie. not at max) then increase sample rate.

What about correlation?

When you cross application boundaries, if the sample-rate of the dependent application (ie. the server in a client-server setup) is higher - then the operations sampled in the client will also be sampled in the server, because the OperationId is propagated and GetTelemetryItemHash(item) will be the same.

If both applications use adaptive there's a very real risk you won't be able to correlate the traces. A server-application running 100s or 1000s of clients will be forced to scale back the adaptive sampling to 0.1% or maybe even less, while your client will probably sample close to 100%.

In Contrast to OpenTelemetry

W3C specifies a standard TraceID, which is something like this: 00-<32 hex chars>-<16 hex chars>-0[01]. 

00 = version. There is only one version right now (02/22)
<32 hex chars> = Operation Id.
<16 hex chars> = Span Id
00/01 = trace flag. 00 = undecided, 01 = sampled

Application Insights is able to understand these TraceIds, but does not use the traceflag. In a distributed scenario, traceflag = 01 should be an indicator to any dependent applications, that the caller has decided to sample this operation, and so should you. Application Insights doesn't care, and just considers the hashcode of the OperationId. This is in part because the TraceSampler is injected very late in the processorchain and is first evaluated when the Telemetry item has been created, ie. after the instrumented operation has completed. Any dependencies will be called before this completion, and so the flag is undecided.

OpenTelemetry on the other hand performs the Sample in/out decision when the operation is initiated, and is able to set the flag.

It should be straight-forward to just consider the ITelemetry item and if the parent context sets the traceflag, then process it.  

public void Process(ITelemetry item)
{
    if(item.Context.Flags == 1L || SamplePercentage > GetTelemetryItemHash(item))
        NextProcessor.Process(item);
}

 

Ingestion Sampling

The final sampling option is to do it at the Azure Ingestion endpoint. In your Application Insights dashboard you can set the sample-rate in % and Azure makes sure to sample (probably using the same hash function) for you. I'm not a fan of this, as you can potentially send a lot of data. Microsoft promises that only the sampled data counts towards your quota/bill but ...

A Few Gotchas

When you use sampling, whether it be Adaptive or Fixed Rate, Application Insights will make assumptions based on your sample-rate. Let's say you are at 5% sample rate, and by chance some operation is chosen by the sampler. Azure will assume that since you sample every 20th operation, that you have had 19 other operations occur, and that they will have had the exact same run time, as the one sampled. The Law of Large Numbers says that your average response times will converge to the "real" average, but you will probably need to consider this, especially if you have periods of low activity.

Rarely used operations will also suffer. Let's say and endpoint is only hit may once or twice a day. At 5% sample rate odds are, that you will only see this operation once every 10-20 days - but when you see it, it will count 20x towards your averages, because Application Insights multiplies by 100/samplerate.

And that's another thing. Application Insights documentation says that Fixed Sample Percentage must be such, that Sample% = 100/N where N is an integer. I guess it's so that Application Insights is able to find the multiplier.

Comments are closed