Collecting Hardware Performance Counters with BenchmarkDotNet
Introduction
BenchmarkDotNet is a powerful .NET library for benchmarking (more about it). This post describes how you can collect Hardware Performance Counters with BenchmarkDotNet. If you want to learn about the ETW internals behind it then you might find this post useful.
The Story
Some time ago I have noticed that software engineers who are working on the Span<T>
are struggling with some benchmarking issues. With problems like unstable results, which BenchmarkDotNet solves. So I asked why don’t they use BenchmarkDotNet?
And then Krzysztof Cwalina replied:
Yeah, we have. In general I love BenchmarkDotNet and use it in many cases. It does have better front end reporting capabilities and is more mature from reliability and statistical analysis perspective. But it seems like BenchmarkDotNet does not support measuring instructions retired, cache misses, and branch mispredictions that we often find very useful. Would be great to combine the efforts/features :-)
And Tanner Gooding mentioned that xunit-performance
can do it:
@KrzysztofCwalina, for Instructions Retired, you need [assembly: MeasureInstructionsRetired].
I also know cache misses and branch mispredictions are also supported, as they are displayed on the internal benchmark comparison site (I’m not sure how to enable those ones though).
I could finally add a new feature to BenchmarkDotNet that was related to performance, not the MSBuild/project.json stuff ;) So I just downloaded the source codes of xunit-performance and PerfView and dig into the details. Microsoft.Diagnostics.Tracing.TraceEvent
is not OSS yet, so I used my favorite .NET decompiler ILSpy to get the full picture.
Limitations
What we have today comes with following limitations:
- Accuracy is limited by sampling frequency and additional code performed in the benchmarked process by our Engine. It’s good but not perfect. For more accurate results you should use some profilers like Intel VTune Amplifier.
- Windows 8+ only
- No Hyper-V (Virtualization) support
- Requires to run as Admin (ETW Kernel Session)
- No
InProcessToolchain
support (#394) - The initial list of available Counters:
- Timer,
- TotalIssues,
- BranchInstructions,
- CacheMisses,
- BranchMispredictions,
- TotalCycles,
- UnhaltedCoreCycles,
- InstructionRetired,
- UnhaltedReferenceCycles,
- LlcReference,
- LlcMisses,
- BranchInstructionRetired,
- BranchMispredictsRetired
How to use it?
You need to install latest BenchmarkDotNet.Diagnostics.Windows
package.
It can be enabled in three ways:
- Use the new attribute (apply it on a class that contains Benchmarks):
[HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)]
public class TheClassThatContainsBenchmarks { /* benchmarks go here */ }
- Tell your custom config to use it:
private class CustomConfig : ManualConfig
{
public CustomConfig()
{
Add(Job.Default);
Add(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions);
}
}
- Passing
--counters
command line argument toBenchmarkSwitcher
. Example:--counters BranchInstructions+BranchMispredictions
Sample
Let’s use our new diagnoser to answer the famous “Why is it faster to process a sorted array than an unsorted array? ”.
class Program
{
static void Main(string[] args) => BenchmarkRunner.Run<Cpu_BranchPredictor>();
}
[HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)]
public class Cpu_BranchPredictor
{
private const int N = 32767;
private readonly int[] sorted, unsorted;
public Cpu_BranchPredictor()
{
var random = new Random(0);
unsorted = new int[N];
sorted = new int[N];
for (int i = 0; i < N; i++)
sorted[i] = unsorted[i] = random.Next(256);
Array.Sort(sorted);
}
private static int Branch(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
if (data[i] >= 128)
sum += data[i];
return sum;
}
private static int Branchless(int[] data)
{
int sum = 0;
for (int i = 0; i < N; i++)
{
int t = (data[i] - 128) >> 31;
sum += ~t & data[i];
}
return sum;
}
[Benchmark]
public int SortedBranch() => Branch(sorted);
[Benchmark]
public int UnsortedBranch() => Branch(unsorted);
[Benchmark]
public int SortedBranchless() => Branchless(sorted);
[Benchmark]
public int UnsortedBranchless() => Branchless(unsorted);
}
and the results:
Method | Mean | Mispredict rate | BranchInstructions/Op | BranchMispredictions/Op |
---|---|---|---|---|
SortedBranch | 21.4539 us | 0,04% | 70121 | 24 |
UnsortedBranch | 136.1139 us | 23,70% | 68788 | 16301 |
SortedBranchless | 28.6705 us | 0,06% | 35711 | 22 |
UnsortedBranchless | 28.9336 us | 0,05% | 35578 | 17 |
High Mispredict Rate (23,70%
) is a clear signal that we are victims of branch prediction failure. Which results in 6x
slowdown!!
Note: The results are scaled per operation. Mispredict rate is an extra column that is visible if you use BranchInstructions
and BranchMispredictions
counters.
Sources
- Understanding How General Exploration Works in Intel® VTune™ Amplifier XE by Jackson Marusarz (Intel)
- CPU Performance Counters on Windows by Bruce Dawson
- What is Branch Prediction? by Mysticial