Adam Sitnik

Profiling .NET on Linux with BenchmarkDotNet

2023-01-13T00:00:00+00:00

PerfCollectProfiler

PerfCollectProfiler is a new BenchmarkDotNet diagnoser (plugin) that was released as part of 0.13.3. It can profile the benchmarked .NET code on Linux and export the data to a trace file which can be opened with PerfView, speedscope or any other tool that supports perf file format.

Demo

Following code is one of the official BenchmarkDotNet samples

using System.IO;
using BenchmarkDotNet.Attributes;

namespace BenchmarkDotNet.Samples
{
    [PerfCollectProfiler(performExtraBenchmarksRun: false)]
    public class IntroPerfCollectProfiler
    {
        private readonly string path = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName());
        private readonly string content = new string('a', 100_000);

        [Benchmark]
        public void WriteAllText() => File.WriteAllText(path, content);

        [GlobalCleanup]
        public void Delete() => File.Delete(path);
    }
}

The command:

sudo dotnet run -c Release -f net7.0 --filter '*PerfCollectProfiler*' --profiler perf --job short

The regular output:

BenchmarkDotNet=v0.13.3.20230113-develop, OS=ubuntu 18.04
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.101
  [Host] : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Dry    : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  


// * Diagnostic Output - PerfCollectProfiler *
Exported 1 trace file(s). Example:
/home/adam/projects/BenchmarkDotNet/samples/BenchmarkDotNet.Samples/BenchmarkDotNet.Artifacts/BenchmarkDotNet.Samples.IntroPerfCollectProfiler.WriteAllText-20230113-180354.trace.zip

Method	Mean	Error	StdDev
WriteAllText	96.83 us	51.98 us	2.849 us

And the new trace file opened with speedscope:

The Story

In the middle of 2019 I was working on improving the performance of string methods on Linux (#24889, #24973). When I was done with the issues reported by the customers, it became clear to me that I need to get a better understanding of all Windows vs Unix .NET performance gaps. My goal was to fix the most important issues before the customers hit them. Thanks to previous investments in the performance culture of the .NET Team it was an easy job, as all I had to do was running all dotnet/performance micro-benchmarks on the same hardware for Windows and Unix and compare the results using ResultsComparer tool. To make it an apples-to-apples comparison I configured my work PC to dual-boot Windows 10 and Ubuntu 18.04. I wanted to include macOS too, as many .NET users develop software on macOS. So I’ve used Boot Camp and installed Windows on my MacBook Pro. The comparison has identified multiple gaps: #13628, #31268, #31269, #31270, #31271, #31273, #31275, #13669, #13675, #13676, #13684 and #31396.

Since there was plenty of them, I decided to automate the profiling. I’ve created a new BenchmarkDotNet branch and started working on a wrapper for perfcollect. perfcollect is a very powerful bash script that automates data collection. It’s internally using perf and LTTng. All the credit for perfcollect goes to Brian Robbins, who authored this tool. perfcollect does all the heavy lifting, my BenchmarkDotNet plugin is built upon on the work of Brian.

I was not able to get it working quickly and I got stuck, so I pushed my changes and just switched to VTune. For this particular investigation, I’ve stopped using perfcollect as I did not like the fact that I had to copy the produced trace file to Windows to open it with PerfView. VTune provided me the answers I needed and I’ve moved one to some other work.

I got back to working on it in 2020, but again with no success. In September 2022 together with Jan Vorlicek we have started working on adding arm64 support to the Disassembly Diagnoser (another topic for a blog post). We did that as a part of a one week long internal Microsoft Open Source hackathon. We made some great progress quicker than expected and still had some time left, so I’ve asked Jan for help with the perfcollect plugin (BTW Jan is one of the smartest people I’ve ever got to work with, at the same time being very humble and always eager to help). Jan has identified the source of my problems, I’ve changed the way of stopping the perfcollect process and got it working. The rest is history.

How it works

PerfCollectProfiler uses perfcollect bash script for profiling and dotnet symbol for downloading symbols for the native libraries.

Before the process with benchmarked code is started, the plugin searches for perfcollect file stored in artifacts folder. If it’s not present it means that the tool has not been installed yet. In such case, it loads the script file from library resources (the script is embeded in the .dll to ensure we are using a version that we tested), stores it one the disk, makes it an executable and invokes the install command (with -force option to avoid the need of user input for confirmation).

The next thing it does is identifying .NET SDK folder path and searching for missing native symbol files (.so.dbg). When some symbols are missing, it installs dotnet symbol tool in a dedicated folder (to always use latest version and avoid issues with broken existing configs) and commands it to recursively download symbols for all native libraries present in the SDK folder.

Sample log output:

// start dotnet tool install dotnet-symbol --tool-path "/tmp/BenchmarkDotNet/symbols" in 
You can invoke the tool using the following command: dotnet-symbol
Tool 'dotnet-symbol' (version '1.0.406001') was successfully installed.
// command took 2.36s and exited with 0
// start /tmp/BenchmarkDotNet/symbols/dotnet-symbol --recurse-subdirectories --symbols "/usr/share/dotnet/dotnet" "/usr/share/dotnet/lib*.so" in 
Downloading from https://msdl.microsoft.com/download/symbols/
/usr/share/dotnet/dotnet.dbg already exists, file not written
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Globalization.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libcoreclrtraceptprovider.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Security.Cryptography.Native.OpenSsl.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libmscordaccore.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.IO.Compression.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Net.Security.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libcoreclr.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libclrjit.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libmscordbi.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libhostpolicy.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libclrgc.so.dbg

Once everything is in place, the diagnoser starts perfcollect process. perfcollect does all the heavy lifting (the creation of lttng sessions, perf tool usage etc). When BenchmarkDotNet starts the benchmarking process, it sets all the necessary environment variables. By doing that, it ensures that all symbols will get solved and the trace file will be complete.

When benchmarking process quits, the plugin stops the perfcollect process by killing it with SIGINT signal. perfcollect works for a moment, exports the trace file and quits.

Limitations

PerfCollectProfiler has following limitations:

It supports only Linux. For Windows you can use EtwProfiler, for other Unixes like macOS the EventPipeProfiler. EventPipeProfiler supports every OS, but it has no information about native methods.
Requires to run as root. It’s a PITA as all the files BDN creates directly and indirectly will be created by the root.
No InProcessToolchain support (and no plans to add it).
Currently the trace file contains no events, but we are working on it.

How to use it?

You need to install BenchmarkDotNet 0.13.3 or newer.

It can be enabled via command line arguments (as long as you pass args to BenchmarkSwitcher or BenchmarkRunner). You won’t need to recompile your code to use it:

--profiler perf

Or you can extend the DefaultConfig.Instance with new instance of PerfCollectProfiler and the profiler will work for all benchmarks:

class Program
{
    static void Main(string[] args) 
        => BenchmarkSwitcher
            .FromAssembly(typeof(Program).Assembly)
            .Run(args,
                DefaultConfig.Instance
                    .With(PerfCollectProfiler.Default)); // HERE
}

Or you can apply the attribute, but it will work only for benchmarks from given class:

[PerfCollectProfiler]
public class TheClassThatContainsBenchmarks { /* benchmarks go here */ }

Configuration

To configure the new diagnoser you need to create an instance of PerfCollectProfilerConfig class and pass it to the PerfCollectProfiler constructor. The parameters that config ctor accepts are:

performExtraBenchmarksRun: when set to true, benchmarks will be executed one more time with the profiler attached. If set to false, there will be no extra run but the results will contain overhead. False by default, as I expect the overhead to be less than 3%.
timeoutInSeconds: how long BenchmarkDotNet should wait for the perfcollect script to finish processing the trace. 300s by default.

The default config should be fine for 99% of users ;)

Analyzing the trace files

There are multiple ways to work with the trace files produced by perfcollect:

You can copy them to Windows and open with PerfView.
You can unzip the trace file, take the file produced by perf utility and open it with any tool that supports perf file format.

If you are not familiar with speedscope you can read my old blog post about it. The tool is so intuitive that you don’t really need to prepare yourself for using it.

Unzip the trace file, go to https://www.speedscope.app/, select Browse and choose the perf.data.txt file.
perfcollect by default performs machine-wide profiling and speedscope shows only data from one thread at a time. So you need to select the thread from the thread list:

Now you can just choose on of the tabs, depending on what kind of visualization you prefer. In case you like flamegraphs you can go to “Left Heavy”:

By default, BenchmarkDotNet performs Warmup, Pilot and Overhead phases before starting the actual benchmark workload. Just filter the trace file to actual workload, unless you are interested is cold startup time.

Special Thanks

I wanted to thank:

Brian Robbins for authoring perfcollect and providing ongoing help.
Jan Vorlicek for helping me with the investigation and unblocking me.

No blog posts for the last four years

I have not posted anything on this blog for almost four years. I simply lost the motivation, and I am not comfortable speaking in public about the reasons behind it.

But one of the things that makes me very happy is helping animals. Last year I’ve officially become a volunteer in a local animal shelter. My duties are mainly cleaning the cages, feeding the bunnies, and driving them to/from the vet. But I am also helping the shelter financially.

To optimize my impact, I wanted to kindly ask you for a donation for the bunnies. You can do it online via https://pomagam.pl/en/nowyrok-staredlugi website. For translation from polish you can use this google translate link.

If you donate please leave a comment on the donation website, here on my blog, send me an email or tag me on Twitter. I am going to donate the same amount and respond back. To optimize even further I am going to fill in the paperwork and ask my current employer (Microsoft) to donate the same amount I’ve donated (yes, MS offers such a perk!). So, for every dollar you donate, the shelter gets three dollars.

I want to verify what is the best way I can help the shelter: cleaning bunnies cages or sharing my knowledge online and asking for donations.

If you restore my faith in humanity I am going to blog again. Possible topics:

Cross platform and cross architecture disassembler.
Startup time performance investigation based on System.CommandLine example.
The story and reasoning behind improving Sockets performance on Linux in .NET 5.
Best practices for Fast File IO with .NET.

Thank you, Adam

Profiling .NET Code with PerfView and visualizing it with speedscope.app

2019-03-22T00:00:00+00:00

speedscope.app

According to the official web page, speedscope.app is “a fast, interactive web-based viewer for performance profiles”. But I believe it’s more than that! In my opinion, it’s one of the best visualization tools for performance profiles ever!

Some time ago I have implemented SpeedScopeExporter which allows exporting any .NET Trace file to a speedscope json file format. It was released as part of 2.0.34 TraceEvent library a few months ago, but so far it was not available for the end users from PerfView GUI/command line level.

Yesterday, a new version of PerfView got released with the new possibility to export to speed scope file format. So now the PerfView users can use speedscope.app to view their performance profiles and take advantage of all the goodness it offers!

The Story

Vance Morrison, the .NET Performance Architect has emailed me and asked if I would like to implement web-based flame graphs for our non-Windows user story as I was the guy who implemented it for PerfView a while back.

I knew that the app which does exactly what we needed have already existed. I did not remember the name but I remembered that I retweeted something about it, so I have opened the list of my retweeted posts and found it quickly.

I made sure that the app does not upload the data anywhere, has an MIT license, is actively maintained, has a self-contained version and can handle non-trivial files.

To my surprise, convincing Vance to use it was easy ;)

I did not want to introduce another file format so I decided to export the data to a very simple speedscope JSON-based file format (spec).

I read the file format specification, wrote the tests first and made it work. Handling all edge cases was a lot of fun!

How it works

speedscope.app is a single page application that works with any modern web browser. It supports plenty of file formats, including perf, pprof, chrome and firefox. The profiles are not uploaded anywhere!

Every Trace File contains a huge amount of samples. A sample is more or less a call stack captured by the profiler. The job of SpeedScopeExporter is to group the samples by Threads, make sure they don’t overlap in time, translate the frame ids to method names and save it in a format and order expected by the speedscope.

How to use it?

If you want to export a trace file from PerfView, you need to open it, load the symbols, filter and from the “File” menu choose “Save View As” and then select “Speed Scope Format” from the combo box.

If you want to export a .NET Trace File to speedscope file format without using PerfView you either have to use TraceEven library yourself or wait until the dotnet collect diagnostic tool merges the support for it.

Once you have the .speedscope.json file you just need to open it with speedscope.app. You can also download a self-contained version from https://github.com/jlfwong/speedscope/releases.

Demo

I have profiled the following C# app with PerfView and exported the trace file to .speedscope.json file. For brevity, I am going to skip the PerfView introduction here. You can find the JSON file here.

class Program
{
    static void Main(string[] args)
    {
        A(); B(); A(); B(); A(); B(); A(); B(); A();
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void A()
    {
        for (int i = 0; i < 500_000_000; i++) { }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void B()
    {
        for (int i = 0; i < 500_000_000; i++) { }
    }
}

Time Order View

In the “Time Order” view, call stacks are ordered in chronological order. This is very unique compared to Flame Graphs because it allows us to understand the behavior of an application over time.

Left Heavy

The “Left Heavy” is a reverse Flame Graph. The data is aggregated, not over time.

Sandwitch

The Sandwich view is a table view with all methods from the profile and their associated times. You can sort by self time (exclusive time) or total time (inclusive time).

When you click on one of the methods, you can see all the callers and callees of it. The app is so intuitive that it almost does not need any docs!

Limitations

It’s not possible to show data from multiple threads running at the same time on a single view, so every Thread has it’s own “tab” and you can switch between the threads using the arrows:

Moreover, the app normalizes the relative time for every Thread. If we export a profile for Thread A that did some work between 0-200 ms and Thread B that did some work between 100-110ms the app will show the start time as 0ms for both of them and Thread B activity will be represented as between 0ms and 10ms (not 100-110ms). I was thinking about generating a 1e-10 ms long event at time 0ms for every Thread, but then the app would not scale the UI so nice.

If we don’t have any profile information for a given period of time, we have nothing to show in the “Time Order View”. It’s important to remember that tracing in .NET captures only the call stacks of active Threads, so any blocking IO will be represented as a blank space. In the future, I might use the data from OS/.NET Runtime events to fill this space.

Sample usage - .NET Core Process Startup Time

Using the new tool we can find out how long does it take to start a .NET Core process and see what exactly happens in what order during the startup.

To do that, we need to create a “Hello World” .NET Core app first.

dotnet new console

Then tell the PerfView to Run following command:

dotnet run -c Release

Disable grouping and folding, load all the symbols (see my previous blog post to learn how to do it) and export to speedscope format:

What is the cost of “Hello World” compared to the .NET VM Startup?

What took so long? Let’s zoom it and find out!

Kudos

Jamie Wong is the author of speedscope.app who deserves all the credit! I just wrote a simple exporter which allows us to use his awesome tool!

Profiling Concurrent .NET Code with BenchmarkDotNet and visualizing it with Concurrency Visualizer

2019-01-10T00:00:00+00:00

Concurrency Visualizer Profiler

ConcurrencyVisualizerProfiler is the new diagnoser for BenchmarkDotNet that I have implemented some time ago. It was released as part of 0.11.3. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with Concurrency Visualizer (plugin for Visual Studio, used to be a part of it).

Again with a single config!

Demo

Following code is a real-world benchmark from the ML.NET repository

[ConcurrencyVisualizerProfiler] // !!! use the new diagnoser!!
public class MultiClassClassificationTrain
{
    [Benchmark]
    public void CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron()
    {
        string cmd = @"CV k=5  data=" + _dataPath_Wiki +
            " tr=OVA{p=AveragedPerceptron{iter=10}}" +
            " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
            " xf=Convert{col=logged_in type=R4}" +
            " xf=CategoricalTransform{col=ns}" +
            " xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}" +
            " xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}" +
            " xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}";

        using (var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, CategoricalTransform, AveragedPerceptronTrainer>())
        {
            Maml.MainCore(environment, cmd, alwaysPrintStacktrace: false);
        }
    }
}

The regular output (last two lines are the most important here):

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-GAEOXF : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

// * Diagnostic Output - ConcurrencyVisualizerProfiler *
Exported 1 CV trace file(s). Example:
c:\Projects\mldotnet\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\20181120-1225-23644\netcoreapp2.1\Microsoft\ML\Benchmarks\MultiClassClassificationTrain\CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC.CvTrace
DO remember that this Diagnoser just tries to mimic the CVCollectionCmd.exe and you need to have Visual Studio with Concurrency Visualizer plugin installed to visualize the data.

Method	Mean	StdDev
CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC	69.84 s	2.608 s

And the new trace file opened with Concurrency Visualizer:

The Story

Recently I have been working on improving the performance of ML.NET (you can read more about it in my previous blog post). I wanted to understand the performance characteristics over the time and I knew that ML.NET does most of the things in parallel. FlameGraph is an aggragated form, not over time and per CPU so I could not use it to visualize the data. Few years ago I have been to a presentation where Sasha Goldshtein was using Concurrency Visualizer to easily show which Thread was allocating managed memory and triggering Garbage Collection. I remebered that this is the right tool for visualizing concurrent .NET code.

So I just downloaded it from Visual Studio Market Place, read the docs and watched some Channel 9 videos about it. (Personal recommendation: don’t ask for permission to read the docs or watch some training videos. This is part of doing the job right, not an extra task which can be omitted)

I started using it, but I did not like the fact that I had to do it manually every time I wanted to run some benchmark:

With a quick web search I was able to find a command line tool that can do it for me: Concurrency Visualizer command-line utility aka CVCollectionCmd.exe

So I switched from VS GUI to this command line tool. But again, my process was not fully automated and I was loosing time doing all this manually.

And then I asked myself two quesitons: how does CVCollectionCmd.exe work? Can I create a BenchmarkDotNet diagnoser out of it?

Reverse Engineering

I am now working for Microsoft so I had two options:

send an email to some discission group and asks who owns the tool and could explain me how it works.
do some Reverse Engineering and find it out on my own

Since I don’t like sending emails (in general) and asking for help when I can find the answer in a short time on my own I decided to use the debugger to attach to CVCollectionCmd.exe and just step into some methods. See how it works and what it generates.

To my suprise, the code was very clean and very well structured so finding out how it works was really easy!

So how does CVCollectionCmd.exe work? It creates two ETW sessions (kernel and user), enables some ETW providers and simply collects the data. After the tracing is done, it creates simple XML file with some basic info for the Concurrency Visualizer: process id, used providers and paths to both trace files.

So what I did next, was implementing a new BenchmarkDotNet diagnoser that does exactly the same thing ;)

How it works

ConcurrencyVisualizerProfiler uses EtwProfiler, which can be customized to enable requested ETW providers and profile the code.

Before the process with benchmarked code is started, EtwProfiler starts User and Kernel ETW sessions. Every session writes data to it’s own file and captures different data. User session listens for the .NET Runtime events (TPL, ThreadPool, ParallelLinq etc) while the Kernel session gets CPU stacks, context switches, IO events and some more. After this, the process with benchmarked code is started. During the benchmark execution all the data is captured and written to a trace file. Moreover, BenchmarkDotNet Engine emits it’s own events to be able to differentiate jitting, warmup, pilot and actual workload when analyzing the trace file. When the benchmarking is over, both sessions are closed and the two trace files are merged into one.

After both sessions are merged into a single file ConcurrencyVisualizerProfiler emits an XML file with all the data relevant for Concurrency Visualizer (the Visual Studio plugin). The .CVTrace file name is reported by the diagnoser, you can find it in the BenchmarkDotNet output:

// * Diagnostic Output - ConcurrencyVisualizerProfiler *
Exported 1 CV trace file(s). Example:
Full_path_ommited_for_brevity.CvTrace
DO remember that this Diagnoser just tries to mimic the CVCollectionCmd.exe and you need to have Visual Studio with Concurrency Visualizer plugin installed to visualize the data.

The difference between the trace files produced by CVCollectionCmd.exe and my new Diagnoser is that the trace files produced by the diagnoser can be also opened with PerfView and Windows Performance Analyzer. It was just a matter of correct naming of the file ;)

Limitations

What we have today comes with following limitations:

ConcurrencyVisualizerProfiler works only on Windows
Requires to run as Admin (to create ETW Kernel Session)
No InProcessToolchain support
To get the best possible managed code symbols you should configure your project in following way:

pdbonly
true

How to use it?

You need to install latest BenchmarkDotNet.Diagnostics.Windows package. It can be enabled in few ways, some of them:

Use the new attribute (apply it on a class that contains Benchmarks):

[ConcurrencyVisualizerProfiler]
public class TheClassThatContainsBenchmarks { /* benchmarks go here */ }

Extend the DefaultConfig.Instance with new instance of ConcurrencyVisualizerProfiler:

class Program
{
    static void Main(string[] args) 
        => BenchmarkSwitcher
            .FromAssembly(typeof(Program).Assembly)
            .Run(args,
                DefaultConfig.Instance
                    .With(new ConcurrencyVisualizerProfiler())); // HERE
}

Passing -p CV or --profiler CV command line argument to BenchmarkSwitcher

How to open the .CVTrace file in Visual Studio

After installing Concurrency Visualizer from Visual Studio Market Place you just need to go to: Analyze -> Concurrency Visualizer -> Open Trace

Sample usages

CPU Utilization

Using the new diagnoser to find out what is the CPU utilization for BinaryTrees_5 benchmark in the dotnet/performance repository.

dotnet run -c Release -f netcoreapp2.1 --filter *BinaryTrees_5* --profiler CV

Note: As you can see, we are using 2 - 2.5 CPUs out of twelve!

Synchronization

Using the new diagnoser to find out what % of time is spent for synchronization for SpectralNorm_3 benchmark in the dotnet/performance repository.

dotnet run -c Release -f netcoreapp2.1 --filter *spectralnorm_3* --profiler CV

Note: 90% of the time is spent for synchronization.

Special Thanks

I wanted to thank Wojciech Nagórski who has fixed two bugs (#962, #958) that previously required all BenchmarkDotNet.Diagnostics.Windows users to use some weird workarounds to get it working. Thanks to Wojciech all you need to do is to install the package!

Summary

Concurrency Visualizer is very powerfull tool that can visualize concurrent code in user-friendly way. It’s a plugin for Visual Studio which can be downloaded for free from Visual Studio Market Place.

I really encourage you to read the docs and watch some Channel 9 videos and give it a try.

With the new BenchmarkDotNet feature you can get the trace file by running a single command from the console!

Sample performance investigation using BenchmarkDotNet and PerfView

2018-11-14T00:00:00+00:00

Introduction

Part of my job on the .NET Team is to improve the performance of existing .NET libraries. My current goal is to identify performance bottlenecks in ML.NET and recognize common performance issues that should be addressed by .NET framework.

In this blog post, I am describing how I approach sample performance problem using available free .NET tools and best practices for performance engineering.

Benchmark

The first thing I need is a good benchmark which tests the performance of the feature that I care about. By good benchmark, I mean something that measures only the thing that I am interested in and produces accurate, stable and repeatable results.

ML.NET repository has many benchmarks and it’s already using a very good tool for benchmarking (yes, it’s of course BenchmarkDotNet ;) ).

The first thing I do is running all of the real-life scenario benchmarks, order them by time (descending) and importance (based on the info from the manager) and choose top 1.

Why do I choose only the real-life scenario benchmarks? Because I want to improve the end user experience. I don’t want to improve a micro-benchmark which tests only a piece of the end product.
Why do I take the most time-consuming benchmark? Because the longer it takes to execute some code, the more performance issues it probably has. I don’t have an infinite amount of time and I want to make an impact.
Why do I focus on the scenarios pointed by the manager? Because the manager knows what is important for our customers. If I solve a perf issue in a scenario that nobody cares about it’s not worth too much ;)

So in my case the benchmark that I decided to focus on is CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron which looks like this:

[Benchmark]
public void CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron()
{
    string cmd = @"CV k=5  data=" + _dataPath_Wiki +
        " tr=OVA{p=AveragedPerceptron{iter=10}}" +
        " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
        " xf=Convert{col=logged_in type=R4}" +
        " xf=CategoricalTransform{col=ns}" +
        " xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}" +
        " xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}" +
        " xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}";

    using (var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, CategoricalTransform, AveragedPerceptronTrainer>())
    {
        Maml.MainCore(environment, cmd, alwaysPrintStacktrace: false);
    }
}

Profiler

Benchmark can tell me only how long it takes to execute given piece of code. What I also need is a Profiler to find out which methods are being executed and for how long. In my case, I am going to use ETWProfiler which is just a BenchmarkDotNet plugin that uses ETW for profiling.

Run the benchmark before applying any changes

To choose the benchmark I am using --filter option, to use ETWProfiler just --profiler ETW. Moreover I want to store the results in a dedicated folder to be able to compare them later after I apply some improvements. For this purpose I am using --artifacts.

I don’t want the benchmark or profile to include any noise, so I close ALL applications except a single command line window!

And run following command:

dotnet run -c Release -- --filter *WikiDetox_WordEmbeddings_OVAAveragedPerceptron --profiler ETW --artifacts .\BenchmarkDotNet.Artifacts\before

The regular output:

BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17134.345 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507503 Hz, Resolution=285.1031 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-OXDQNP : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp2.1  IterationCount=1  
LaunchCount=3 RunStrategy=ColdStart  

Method	Mean	StdDev
CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron	286.7 s	5.650 s

And the path to Trace file with Profile information:

// * Diagnostic Output - EtwProfiler *
Exported 1 trace file(s). Example:
C:\Projects\mldotnet\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\20181109-0426-20620\netcoreapp2.1\Microsoft\ML\Benchmarks\MultiClassClassificationTrain\CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron.etl

Analysing the Trace File

To analyse the data from the Trace file I am using PerfView, which is a free .NET profiler from Microsoft. If you are not familiar with PerfView you should watch these instructional videos first. These videos were recorded by .NET Performance Architect and PerfView creator - Vance Morisson. Trust me, it’s really worth watching these videos!!!

The first thing I need to do is to open the trace file in PerfView and choose “dotnet –benchmarkName (…)” from CPU Stacks Window (PerfView sorts them descending by CPU consumption):

The trace file contains symbols for the managed methods emitted by CLR during CLR Rundown. However, it does not contain the native method symbols. But don’t worry! PerfView noticed that the .pdb file with native symbols is stored on my disk. I have just built this file and I trust it, so I click Yes.

In my previous blog post, I have described how to use PerfView to filter Trace files produced by BenchmarkDotNet to set the time range to the actual benchmark execution. You can read it here. In this particular benchmark, I don’t set the time range because the benchmark is executed just once and moreover I do care about CLR startup. So I am interested in the entire process lifetime. When the benchmark is executed many times I filter the trace to a single benchmark iteration as described in the previous blog post.

Now I go directly to the FlameGraph tab to get some quick overview:

Is it all I need? No! PerfView by default groups the data by modules. I disable the module grouping (GroupPats = [no grouping])

But where is the missing data? Most probably hidden by Folding. So let’s set Fold%=0

By hovering the mouse over the flame boxes I can see that the code is multi-threaded. And even with Flamegraph, it’s hard to read! So let’s group the data by setting GroupPats = Thread ->AllThreads (mind the spaces!)

And let’s set folding to 1% again to get it human-friendly:

But as you might notice, some FlameBoxes contain ?! in their names. It means that we need to load the symbols for these methods. The easiest way of doing this is to go to the By name tab, and press Ctrl+A (select all) and then Alt+S (load symbols).

IMHO now I have a very good overview and I can start the analysis!

The Bottleneck

By just looking at the Flamegraph of entire process lifetime I could say that Parsing might be one of my bottlenecks (it’s the biggest box). But it’s not enough to identify a bottleneck.

When I switch to the “By name” tab I can see all the methods sorted descending by exclusive time (the ones that does actual computations).

But the most important information is visible in the simple histogram:

What does this information tell us? That parsing (red block 2) is a performance bottleneck here! Moreover, as you can see, Flamegraph itself gives a great overview but does not tell us about the performance over time. This simple histogram does!

Isolate the bottleneck

The next step is to write a benchmark that isoloates the bottleneck.

In my case it’s following benchmark:

[Benchmark]
public WordEmbeddingsTransform CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse()
{
    string cmd = @"CV k=5  data=" + _dataPath_Wiki +
        " tr=OVA{p=AveragedPerceptron{iter=10}}" +
        " loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}" +
        " xf=Convert{col=logged_in type=R4}" +
        " xf=CategoricalTransform{col=ns}" +
        " xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}" +
        " xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}" +
        " xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}";

    using (var environment = EnvironmentFactory.CreateClassificationEnvironment<TextLoader, CategoricalTransform, AveragedPerceptronTrainer>())
    {
        return new WordEmbeddingsTransform(
            environment,
            modelKind: WordEmbeddingsTransform.PretrainedModelKind.FastTextWikipedia300D,
            new WordEmbeddingsTransform.ColumnInfo("FeaturesText_TransformedText", "FeaturesWordEmbedding"));
    }
}

Now I again turn off everything and run the benchmark with ETWProfiler enabled. The results I get are:

Method	Mean
CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse	153.0 s

Analyse the bottlneck profile

After some filtering in PerfView we can see following Flamegraph:

Explanation:

float.TryParse - 56% - it’s the cost of parsing a float, there is very little we can do about it (quickly)
NumberFormatInfo.CurrentInfo - 8% - anytime we call float.TryParse and not provide the NumberFormatInfo the framework calls NumberFormatInfo.CurrentInfo. We can easily read it once and provide in explicit way to save the 8%.
String.Split - 15% - we should not be using Split when we can do slicing with Span!
StreamReader.ReadLine - 12% - it’s the cost of reading a file, there is very little we can do about it (quickly)
String.Substring - 1% - again we should not be using Substring when we can do slicing with Span!

Make sure the code has unit test coverage

Before I apply any optimizations I need to make sure that I have some unit tests to not introduce any new bugs! I don’t know the product well and I don’t want to waste my time for manual testing. Moreover, having a test commited before the changes will make it more likely for the project maintainers to accepty my optimizations in a PR.

So I just write a following test first:

public class LineParserTests
{
    public static IEnumerable<object[]> ValidInputs()
    {
        foreach (var line in new string[]
        {
            "key 0.1 0.2 0.3", "key 0.1 0.2 0.3 ",
            "key\t0.1\t0.2\t0.3", "key\t0.1\t0.2\t0.3\t" // tab can also be a separator
        })
        {
            yield return new object[] { line, "key", new float[] { 0.1f, 0.2f, 0.3f } };
        }
    }

    [Theory]
    [MemberData(nameof(ValidInputs))]
    public void WhenProvidedAValidInputParserParsesKeyAndValues(string input, string expectedKey, float[] expectedValues)
    {
        var result = Transforms.Text.LineParser.ParseKeyThenNumbers(input);

        Assert.True(result.isSuccess);
        Assert.Equal(expectedKey, result.key);
        Assert.Equal(expectedValues, result.values);
    }

    [Theory]
    [InlineData("")]
    [InlineData("key 0.1 NOT_A_NUMBER")] // invalid number
    public void WhenProvidedAnInvalidInputParserReturnsFailure(string input)
    {
        Assert.False(Transforms.Text.LineParser.ParseKeyThenNumbers(input).isSuccess);
    }
}

Writing a test first is a future investment that pays off very quickly! I have never regretted writing a test, but the few times I didn’t write a test I had to pay for it later..

Once I have the tests I move the existing parsing logic to a dedicated method:

public static (bool isSuccess, string key, float[] values) ParseKeyThenNumbers(string line)
{
    char[] delimiters = { ' ', '\t' };
    string[] words = line.TrimEnd().Split(delimiters);
    string key = words[0];
    float[] values = words.Skip(1).Select(x => float.TryParse(x, out var tmp) ? tmp : Single.NaN).ToArray();
    if (!values.Contains(Single.NaN))
        return (true, key, values);

    return (false, null, null);
}

Closer look

Let’s analyse the code from perf perspective line by line:

char[] delimiters = { ' ', '\t' }; - the array is alocated every time the method is called. It should be moved to a static readonly field. (in the original code it was allocated once per file so it was not that bad)
line.TrimEnd - this method allocates new string if the trimming is required
Split(delimiters) - this method allocates an array of strings and the strings themselves
words.Skip(1).Select(x => float.TryParse(x, out var tmp) ? tmp : Single.NaN).ToArray() - every LINQ method allocates an enumerator. Moreover ToArray allocates entire array. Typically it’s not an issue, but here every cycle matters (we are on a very hot path).
values.Contains(Single.NaN) - contains is O(n), it’s not required here. We should just stop when TryParse returns false.

Apply the optimizations

If we target .NET Standard 2.0 we can’t use all the methods that accept Span like float.TryParse(ReadOnlySpan) so we just remove the LINQ and move NumberFormatInfo.CurrentInfo outside of the loop:

public static (bool isSuccess, string key, float[] values) ParseKeyThenNumbers(string line)
{
    if (string.IsNullOrWhiteSpace(line))
        return (false, null, null);

    string[] words = line.TrimEnd().Split(_delimiters);

    NumberFormatInfo info = NumberFormatInfo.CurrentInfo; // moved otuside the loop to save  8% of the time
    float[] values = new float[words.Length - 1];
    for (int i = 1; i < words.Length; i++)
    {
        if (float.TryParse(words[i], NumberStyles.Float | NumberStyles.AllowThousands, info, out float parsed))
            values[i - 1] = parsed;
        else
            return (false, null, null); // fail as soon as something is wrong
    }

    return (true, words[0], values);
}

Which gives us following result:

Method	Mean
CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse	141.0 s

Which is exactly the 8% we saved by moving NumberFormatInfo.CurrentInfo outside of the loop. I am not happy about the fact that I had to use such trick to make it faster, so I reported an issue in the JIT repo.

However, with .NET Standard 2.1 or just .NET Core 2.1+ we can take full advantage of Span!

public static (bool isSuccess, string key, float[] values) ParseKeyThenNumbers(string line)
{
    if (string.IsNullOrWhiteSpace(line))
        return (false, null, null);

    ReadOnlySpan<char> trimmedLine = line.AsSpan().TrimEnd(); // TrimEnd creates a Span, no allocations

    int firstSeparatorIndex = trimmedLine.IndexOfAny(' ', '\t'); // the first word is the key, we just skip it
    ReadOnlySpan<char> valuesToParse = trimmedLine.Slice(start: firstSeparatorIndex + 1);

    int valuesCount = 0; // we count the number of values first to allocate a single array with of proper size
    for (int i = 0; i < valuesToParse.Length; i++)
        if (valuesToParse[i] == ' ' || valuesToParse[i] == '\t')
            valuesCount++;

    float[] values = new float[valuesCount + 1]; // + 1 because the line is trimmed and there is no whitespace at the end
    int textStart = 0;
    int valueIndex = 0;
    NumberFormatInfo info = NumberFormatInfo.CurrentInfo; // moved otuside the loop to save  8% of the time
    for (int i = 0; i <= valuesToParse.Length; i++)
    {
        if (i == valuesToParse.Length || valuesToParse[i] == ' ' || valuesToParse[i] == '\t')
        {
            var toParse = valuesToParse.Slice(textStart, i - textStart);

            if (float.TryParse(toParse, NumberStyles.Float | NumberStyles.AllowThousands, info, out float parsed))
                values[valueIndex++] = parsed;
            else
                return (false, null, null);

            textStart = i + 1;
        }
    }

    return (true, new string(trimmedLine.Slice(0, firstSeparatorIndex)), values);
}

Which gives us following result:

Method	Mean
CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse	129.1 s

Does the code look cleaner? No! It’s harder to understand what it does. I have sacrificed readability for performance only because the gain was worth it. I don’t do it by default in every place of our app. You also should not.
Does it produce correct results! Yes, I have unit tests which test the correctness.

Utf8Parser

The code sample with Span looks really complicated.. It would be nice if .NET could provide some primitives to help with such scenarios. The good news is that .NET Core 2.1 has introduced a new type called Utf8Parser. Honestly, I forgot about its existence, but my teammate Tanner reminded me of that in code review. Thank you Tanner!

Using Utf8Parser simplifies my code a lot:

internal static (bool isSuccess, string key, float[] values) ParseKeyThenNumbers(ReadOnlySpan<byte> line)
{
    if (line.IsEmpty)
        return (false, null, null);

    int firstSeparatorIndex = line.IndexOfAny((byte)' ', (byte)'\t'); // the first word is the key, we just skip it
    ReadOnlySpan<byte> valuesToParse = line.Slice(start: firstSeparatorIndex + 1);

    float[] values = AllocateFixedSizeArrayToStoreParsedValues(valuesToParse);

    int toParseStartIndex = 0;

    for (int valueIndex = 0; valueIndex < values.Length; valueIndex++)
    {
        if (!Utf8Parser.TryParse(valuesToParse.Slice(start: toParseStartIndex), out float parsed, out int bytesConsumed))
            return (false, null, null);

        values[valueIndex] = parsed;
        toParseStartIndex += bytesConsumed + 1; // + 1 is for the whitespace!
    }

    return (true, Encoding.UTF8.GetString(line.Slice(0, firstSeparatorIndex)), values);
}

/// 
/// we count the number of values first to allocate a single array with of proper size
/// 
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static float[] AllocateFixedSizeArrayToStoreParsedValues(ReadOnlySpan<byte> valuesToParse)
{
    int valuesCount = 0;

    for (int i = 0; i < valuesToParse.Length; i++)
        if (valuesToParse[i] == ' ' || valuesToParse[i] == '\t')
            valuesCount++;

    return new float[valuesCount];
}

What’s next?

Let’s take a look at the Flamegraph of our optimized method. We have:

float.TryParse - there is not really much we can do here without big changes
StreamReader.ReadLine - same as above

Are we done here? NO! Two minutes to parse a 6 GB text file is still too much. What can we do when we can’t optimize the single-threaded code any further? We parallelize it!

Parallel

After we squeeze the single-threaded code we can parallelize it. With System.Threading.Tasks.Parallel* and System.Collections.Concurrent* it’s really easy!

Before (some parts omitted for brevity):

using (StreamReader sr = File.OpenText(_modelFileNameWithPath))
{
    while ((line = sr.ReadLine()) != null)
    {
        (bool isSuccess, string key, float[] values) = LineParser.ParseKeyThenNumbers(line);

        if (isSuccess)
            model.AddWordVector(ch, key, values);
    }
}

After (again some parts omitted for brevity):

var parsedData = new ConcurrentBag<(string key, float[] values, long lineNumber)>();

Parallel.ForEach(File.ReadLines(_modelFileNameWithPath),
    (line, parallelState, lineNumber) =>
    {
        (bool isSuccess, string key, float[] values) = LineParser.ParseKeyThenNumbers(line);

        if (isSuccess)
            parsedData.Add((key, values, lineNumber));
    });

foreach (var parsedLine in parsedData.OrderBy(parsedLine => parsedLine.lineNumber))
    model.AddWordVector(ch, parsedLine.key, parsedLine.values);

And the new results with x3 speedup:

Method	Mean
CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse	39.89 s

Important:

I have used specialized Concurrent collection here that allows me to add items in a thread-safe way without locks. Adding manual synchronization would ruin the performance. Do use ConcurrentCollections, try to avoid using locks whenever you can!
The order of lines is important so after processing entire file I am ordering the elements by the line number and then adding to the model. I did not know that, but the existing unit test reminded me of that very quickly ;)
I did not want to create any extra memory pressure for the GC so I have used ValueTuple represented as (bool isSuccess, string key, float[] values). ValueTuple is a Value Type, you can read my previous blog post to learn more.

Time to send a PR

ML.NET is evolving very quickly over the time. I definitely don’t want to have a long living branch with many optimizations and solve merge conflicts every day, so I just create one PR per one optimization. A small PR is also easier to review. And if I introduce a new bug it’s easier to find the single change that caused it.

Before I create the PR I also remove the temporary benchmark for file parsing bottleneck. Other benchmarks include it in the execution path and it takes a lot of time to run it. I want to keep the benchmark suite focused on ML.NET, without duplicates and with a short time to run entire suite.

If your benchmarks suite contains duplicates and it takes a LOT of time to execute all of the benchmarks the developers stop using it. You need to keep it simple and focused. My personal recommendation is that it should not take longer than a lunch break to run all of the benchmarks. Why? If I apply some changes I just run the benchmarks before I go to lunch and when I am back I have the results.

If it’s hard to run the benchmarks or it takes too long you can’t expect the developers to run the benchmarks and care for performance.

Summary

Do NOT try to guess what is the performance issue.
DO write tests first to save time and avoid introducing new bugs.
DO use a profiler to find out where the issues are.
DO use benchmarks to measure the improvements and compare different approaches.
Do NOT reinvent the wheel, .NET Framework most probably already have what you need.

Before:

Method	Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron	286.7 s
WikiDetox_WordEmbeddings_SDCAMC	184.1 s

After:

Method	Mean
WikiDetox_WordEmbeddings_OVAAveragedPerceptron	174.24 s
WikiDetox_WordEmbeddings_SDCAMC	67.82 s

In my next blog post I am going to use Concurrency Visualizer for Visual Studio to continue this investigation until I get 100% CPU consumption on all Cores for processing this huge Utf8 text file.

Profiling .NET Code with BenchmarkDotNet

2018-09-28T00:00:00+00:00

ETW Profiler

EtwProfiler is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of 0.11.2. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with PerfView or Windows Performance Analyzer.

Again with a single config!

Demo

Following code is a real-world benchmark from the ML.NET repository

[EtwProfiler(performExtraBenchmarksRun: false)] // !!! use the new diagnoser!!
public class RankingTrain
{
    private string _mslrWeb10k_Validate;
    private string _mslrWeb10k_Train;

    [GlobalSetup]
    public void Setup()
    {
        _mslrWeb10k_Validate = Path.GetFullPath(TestDatasets.MSLRWeb.validFilename);
        _mslrWeb10k_Train = Path.GetFullPath(TestDatasets.MSLRWeb.trainFilename);
    }

    [Benchmark]
    public void FastTree()
    {
        string cmd = @"TrainTest test=" + _mslrWeb10k_Validate +
            " eval=RankingEvaluator{t=10}" +
            " data=" + _mslrWeb10k_Train +
            " loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138}" +
            " xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features}" +
            " tr=FastTreeRanking{}";

        using (var environment = EnvironmentFactory.CreateRankingEnvironment<RankerEvaluator, TextLoader, HashTransformer, FastTreeRankingTrainer>())
        {
            Maml.MainCore(environment, cmd, alwaysPrintStacktrace: false);
        }
    }
}

The regular output:

BenchmarkDotNet=v0.11.1.755-nightly, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507505 Hz, Resolution=285.1029 ns, Timer=TSC
.NET Core SDK=2.2.100-preview2-009404
  [Host] : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT
  Dry    : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT

// * Diagnostic Output - EtwProfiler *
Exported 1 trace file(s). Example:
"C:\Projects\machinelearning\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\Microsoft\ML\Benchmarks\RankingTrain\FastTree.etl"

Method	Mean	Error	StdDev
FastTree	32.48 s	1.347 s	0.0761 s

And the new trace file opened with PerfView:

The Story

Recently I have been working on porting all of the 3 000+ CoreFX and CoreCLR benchmarks from xunit-performance to BenchmarkDotNet. My job was to port all of the benchmarks, compare the results, fix the bugs and last but not least implement missing features. EtwProfiler is one of the things that were present in xunit-performance, but not in BenchmarkDotNet.

Initially I was sceptical about this idea because profiling running benchmark is an easy job, however with the amount of benchmarks we have, automating it was a must have.

And now I am very happy about the outcome!

How it works

EtwProfiler uses TraceEvent library which internally uses Event Tracing for Windows (ETW) to capture stack traces and important .NET Runtime events.

Before the process with benchmarked code is started, EtwProfiler starts User and Kernel ETW sessions. Every session writes data to it’s own file and captures different data. User session listens for the .NET Runtime events (GC, JIT etc) while the Kernel session gets CPU stacks and Hardware Counter events. After this, the process with benchmarked code is started. During the benchmark execution all the data is captured and written to a trace file. Moreover, BenchmarkDotNet Engine emits it’s own events to be able to differentiate jitting, warmup, pilot and actual workload when analyzing the trace file. When the benchmarking is over, both sessions are closed and the two trace files are merged into one.

Stopping the sessions after process exit was very important because CLR emits all the symbol information as part of the CLR Rundown.

Limitations

What we have today comes with following limitations:

EtwProfiler works only on Windows (one day I might implement similar thing for Unix using EventPipe)
Requires to run as Admin (to create ETW Kernel Session)
No InProcessToolchain support
To get the best possible managed code symbols you should configure your project in following way:

pdbonly
true

How to use it?

You need to install latest BenchmarkDotNet.Diagnostics.Windows package. It can be enabled in few ways, some of them:

Use the new attribute (apply it on a class that contains Benchmarks):

[EtwProfiler]
public class TheClassThatContainsBenchmarks { /* benchmarks go here */ }

Extend the DefaultConfig.Instance with new instance of EtwProfiler:

class Program
{
    static void Main(string[] args) 
        => BenchmarkSwitcher
            .FromAssembly(typeof(Program).Assembly)
            .Run(args,
                DefaultConfig.Instance
                    .With(new EtwProfiler())); // HERE
}

Passing -p ETW or --profiler ETW command line arguments to BenchmarkSwitcher

Configuration

To configure the new diagnoser you need to create an instance of EtwProfilerConfig class and pass it to the EtwProfiler constructor. The parameters that EtwProfilerConfig ctor takes are:

performExtraBenchmarksRun - if set to true, benchmarks will be executed one more time with the profiler attached. If set to false, there will be no extra run but the results will contain overhead. True by default.
bufferSizeInMb - ETW session buffer size, in MB. 256 by default.
cpuSampleIntervalInMiliseconds - the rate at which CPU samples are collected. By default this is 1 (once a millisecond per CPU). There is a lower bound on this (typically 0.125 ms).
intervalSelectors - interval per harwdare counter, if not provided then default values will be used.
kernelKeywords - kernel session keywords, ImageLoad (for native stack frames) and Profile (for CPU Stacks) are the defaults.
providers - providers that should be enabled, if not provided then default values will be used.

Using PerfView to work with trace files

PerfView is a free .NET profiler from Microsoft. If you don’t know how to use it you should watch these instructional videos first.

If you are familiar with PerfView, then the only thing you need to know is that BenchmarkDotNet performs Jitting by running the code once, Pilot Experiment to determine how many times benchmark should be executed per iteration, non-trivial Warmup and Actual Workload. This is why when you open your trace file in PerfView you will see your benchmark in a few different places of the StackTrace.

The simplest way to filter the data to the actual benchmarks runs is to open the CallTree tab, put “EngineActualStage” in the Find box, press enter and when PerfView selects EngineActualStage in the CallTree press Alt+R to Set Time Range.

If you want to filter the trace to single iteration, then you must go to the Events panel and search for the WorkloadActual/Start and WorkloadActual/Stop events.

Open Events window
Put “WorkloadActual” in the Filter box and hit enter.
Press control or shift and choose the Start and Stop events from the left panel. Hit enter.
Choose iteration that you want to investigate (events are sorted by time).
Select two or more cells from the “Time MSec” column.
Right click, choose “Open Cpu Stacks”.
Choose the process with benchmarks, right-click, choose “Drill Into”

Special Thanks

I wanted to thank:

Jose Rivero who implemented this feature for xunit-performance and reviewed my code. I took a lot from his code.
Brian Robbins for explaining me how CLR Rundown works.
Vance Morrison for immediate release of TraceEvent with bug fixes in the area that was touching private Windows APIs.
Andrey Akinshin for reviewing the PR and pushing me to write the docs. Without Andrey I would not write this blog post ;)

My way of Conducting an Interview

2018-09-03T00:00:00+00:00

Interviewing people is not an easy job to do. You want to find the person which is going to get things done, enjoy working with given project, fit into the team and be happy about the money you can offer.

As an interviewer, you are also being judged by the candidate. You very often create the first impression of the company. So you also need to make a good impression. Nobody wants to work with mean or incompetent people!

In this blog post, I am describing my way of conducting the interview. In my career, I have interviewed a hundred developers and hired over a dozen of them. So my experience is not very reach, it’s limited to “my sample”.

Disclaimer: After joining Microsoft I don’t interview candidates anymore. This post is my personal approach build upon the experience prior to joining MS.

I hope that my experience can help somebody to improve the interviewing process!

Evolution

My interviewing style has evolved over the years. Initially, I was focused on asking very strict technical questions. Some examples:

what are the differences between Value and Reference Types?
how GC works?
what is the difference between DROP TABLE and TRUNCATE TABLE?

But I very soon realized that the fact that somebody can answer similar questions does not mean that she or he can get things done.

Also, the fact that somebody does not know the answers does not mean that he or she can’t search for them and learn fast when needed.

Homework

Before I start interviewing I do the homework: I read the candidate CV, mark the things that I want to talk about. If I don’t read the CV before the interview, and during the interview, I am surprised about things that were stated in the CV it is just a disrespect.

Interviewer: I did not know that you are not graduated in Computer Science.
Candidate: But I have described my education in the resume.
Next: An awkward moment of silence.

Find out as much as possible about the project that you are interviewing for. Is it some kind of a rocket science? Or maintenance? Or simple CRUD? What technology? Does it require a lot of travel?

You need to find a person that is going to fit the project and the team. If you know too little about the project it’s going to be hard or impossible. Be prepared!

Relax!

Candidates are typically very nervous at the beginning of the interview. If you start asking hard questions to a person who barely breaths and just wants to run away from the room you won’t get good answers.

So as an interviewer I always focus first on chilling out the candidate. I start with some Chit Chat about some positive things. An example:

The weather outside sucks. I need to go for a holiday to recharge. What was your recent holiday destination?

After that, we might talk for a few minutes about holidays. The candidate just needs to start talking!

If the candidate has not been on holidays for years you can say what the company has to offer. An example:

We offer 30 days of fully paid holidays. We develop products for our internal purpose, so there are no super-strict deadlines and you can take a week off anytime you want to.

I keep the conversation positive and informal. I continue to the next stage when I feel that the candidate is not nervous anymore.

And yes my dear US readers, 30 or even 35 days of fully paid holidays is totally possible in Europe. The same goes for unlimited sick days.

Warmup

In the beginning, I ask some simple, but very important questions. I also say it loud and clear that I am searching for a good fit for the project and I am expecting honest answers.

What’s your favorite thing about programming?
What are the things about programming that you don’t like?
Could you describe your dream job?

Some candidates say that they can do anything. In such case, I ask if they would be happy to debug some old Java scripts in Oracle Bus or migrate a relational database with no documentation to NoSQL cloud database in two weeks. I need to make sure that they understand that I am asking these questions to avoid putting them to a project they are not going to enjoy.

The answers help me to understand if given person can be a good fit for the project and the position that I am recruiting for.

If the candidate is a good software engineer but not a good fit for this particular project I offer a different project or just stay in touch. Otherwise, if I hire that person then she or he will not be satisfied and probably just leave. Everybody is going to lose a lot of time, and the company a lot of money. Don’t be selfish, think future-wise.

The funniest answer I ever got:

Me: Could you describe your dream job?
Candidate: I would like to be a Team Leader.
Me: Why?
Candidate: Because I would like to make important decisions.
Another interviewer: Would you also like to keep the team motivated, help with planning and estimations?
Candidate: No. Just making the important decisions.

Learning

It’s very hard to find a perfect candidate that is familiar with specific product and technology. Moreover, the requirements change over the time so in my opinion the most important thing is the possibility to learn. So I ask a LOT about learning.

What is your favorite way of learning new things?
When was the last time you learned something new? What was it? How did you apply it at work?
Do you have some gurus? Who are they and why do you value them?
When was the last time when you did not agree with some blog post/book/video? What was that and why you did not agree?
When was the last time you have changed one of your programming habits? What was that and why?
When was the last time when you shared the knowledge with somebody else? Did you like it? Why did you do that?

The answers help me to understand if given person enjoys learning new things and is self-motivated or needs a manager to tell him/her to read a book. I also want to understand the learning process, validation of the information and using it in practice.

This is also the moment when good software engineers are fully relaxed and enjoy the conversation. So I can move on to the next part.

Problem solving and decision making

If I got here it means that the candidate can be a good fit and enjoys learning new things. So what I need to check is problem-solving and decision making. Of course, it’s impossible to fully test that during the interview, but I can at least try.

Here, I start with something like: Could you tell me about your last big assignment that you were working on? What were you supposed to implement, what were the steps that you took and which technology did you choose? Keep the company secrets for yourself, I just want to understand the way you approach and solve problems.

From here I let the candidate talk and later I ask a lot of questions:

Why did you have to do that?
How did you start the task? Did you do some research? Did you talk with the customer?
Which technology and/or components did you choose and why?
When did you write tests?
Was there anything that you could do better?

I ask as many questions as needed. Here I very often go deep into the technical details. I ask deep technical questions related to the tools and compontents used by the candidate to understand if candidate knows how they work. If the candidate does not answer some questions I say something like: Don’t be afraid, it’s perfectly fine that you don’t know something. I am just testing your knowledge. I don’t expect you to know everything. Of course, sometimes it’s just a lie when given person fails to answer a simple question. It’s important to not freak out the candidate!!

I want to hire people who want to understand problems before solving them. Who choose the best tool for given problem. Who write tests first to make sure everything works fine and the problems don’t come back. Those, who understand how the technology they use works.

Very often the initial answer is very promising but the problem-solving skills are bad. One of such of my interviews:

Me: Could you tell me about your last big assignment that you were working on?
Candidate: I have been working on improving the performance of our critical system. I have improved it by 20%. (sounds cool!)
Me: how did you start? Did you do some research?
Candidate: I instrumented every method call in the code with stopwatch and logged the output to the logs.
Me: How did you do that?
Candidate: I manually added stopwatch to every public method.
Me: Why you did not use a profiler? Did you run out of licenses?
Candidate: What is a profiler?
Me: Nothing important, I was just curious. (lie)
Me: How did you found out which methods were taking most of the time?
Candidate: I just read all of the log files.
Me: In which environment were you testing the performance?
Candidate: I copy pasted the instrumented dlls to a live system of one of our customers who was complaining about perf.
Me: Did you create a backup of the app first?
Candidate: Yes (I guess it was a lie too).

So given candidate somehow managed to solve the problem, but did not use the right tool and moreover did not perform the right research. It took a lot of time and risked issues at production.

When a smart developer faces a new problem then she or he starts the web browser and performs a search. Chooses the right tool and gets the job done. If the problem is more complex it might require reading a book, discussing it with other teammates, architect or tech lead.

Failure

This is simple. I just ask about the most recent failure. What was that? Why did it happen? What did you learn from it? What did you do to make sure the problem does not occur again?

I want to hire people who can acknowledge that they did something wrong and learn from their own mistakes. Nobody wants to work with narcissists.

People are typically afraid to answer this question so I encourage them by talking about my recent failure. This helps a lot and opens most of the candidates.

My recent assignment

I never ask the candidates to write code during the interview, instead of that I describe them my recent assignment and ask how they would solve it. I ask about my real task because I want to check how they would approach a real problem. I also know a lot about possible solutions because I always do a good research. It also helps me to make every interview unique.

When the candidate answers: I would just google for it I give my laptop to the candidate and let her/him search and read for a few minutes. I check what they put to the search engine. Most of the candidates are very surprised when I do that. But I am searching for people who can solve new problems, not artifical tasks from Cracking the Coding Interview book. And solving new problems typically requires to perform some web search.

Money

When it comes to the money, I never offer a lowball salary or overpay when compared to the other team members. People talk to each other, do the math or web search and once they find out that they earn much less than others on the same level they get angry and leave. The company is losing a lot of money, more than lowball offer could save. It’s just a matter of time!

Summary

Talk about something positive and not related to interviewing to relax the candidate.
Ask about what the candidate really wants to do and is not willing to do. Is given position and project a good fit?
Ask about learning because learning is the fundamental part of being a software engineer.
Ask many questions about recent assignment to find out how given candidate approaches problems.
Ask about a failure to make sure you filter out the narcists.
Ask about your recent assignment to find out how given person would solve a real problem related to the job that you are interviewing for.

Good luck with your interviews!!

Disassembling .NET Code with BenchmarkDotNet

2017-08-16T00:00:00+00:00

Disassembly Diagnoser

Disassembly Diagnoser is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of 0.10.10. It allows to disassemble the benchmarked .NET code:

to ASM:
- desktop .NET: LegacyJit (32 & 64 bit), RyuJIT (64 bit)
- .NET Core 1.1+ (including .NET Core 2.0) for RyuJIT (64 bit)
- Mono: 32 & 64 bit, including LLVM
to IL and corresponding C# code:
- desktop .NET: LegacyJit (32 & 64 bit), RyuJIT (64 bit)
- .NET Core: 1.1+ (including .NET Core 2.0)

With a single config!

Demo

[DisassemblyDiagnoser(printAsm: true, printSource: true)] // !!! use the new diagnoser!!
[RyuJitX64Job]
public class Simple
{
    int[] field = Enumerable.Range(0, 100).ToArray();

    [Benchmark]
    public int SumLocal()
    {
        var local = field; // we use local variable that points to the field

        int sum = 0;
        for (int i = 0; i < local.Length; i++)
            sum += local[i];

        return sum;
    }

    [Benchmark]
    public int SumField()
    {
        int sum = 0;
        for (int i = 0; i < field.Length; i++)
            sum += field[i];

        return sum;
    }
}

The regular output:

BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]    : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  RyuJitX64 : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0

Job=RyuJitX64  Jit=RyuJit  Platform=X64  

Method	Mean	Error	StdDev
SumLocal	78.27 ns	0.6818 ns	0.6377 ns
SumField	79.24 ns	0.3923 ns	0.3670 ns

And the new disassembly output:

As you can see very similar C# code produces different assembly code which has different performance characteristics. The main goal of disassembly diagnoser is to allow the BenchmarkDotNet users to do an easy comparison of generated assembly code.

The Story

I wanted to develop this feature for a long time. Many people asked for it but I simply did not have any spare time. This was about to change.

I was getting back from an awesome .NET Meetup organized by Karel Zikmund in Prague and I had few spare hours between the checkout from the hotel and my flight. I decided to write a simple PoC and see how it goes.

Initially, I had no idea where to start. Most of the people use the Disassembly window from Visual Studio. But VS is closed-source so I could only use it for validation of my results. The other option was WinDbg. Getting the disassembly with WinDb is non-trivial. And it’s closed-source as well.

I have almost forgotten that Matt Warren did something very similar to this for BenchmarkDotNet a long time ago. I have started analysing his code, which led me to msos by Sasha Goldshtein. And msos by Sasha was exactly what I needed. The credit for super smart disassembling goes to Sasha. I just took his code, tweaked it a little and extended. I have found and fixed some bugs in msos and ClrMD. So all sides benefit from being OSS ;)

How it works?

As some of you might know in BenchmarkDotNet we have the host process (what you run in the console) and child process, which executes the benchmark and reports results back to the host. The child process is generated, compiled and executed by the host. With such architecture, we can benchmark given .NET code for any config desired by the user (any JIT, any .NET framework, any GC configuration). Last but not least it helps to make the results more stable. GC is self-tuning and JIT can make some extra optimizations, but with process per benchmark, you always get the clean score.

Desktop .NET

Based on the idea from msos the host is using ClrMD to attach to the child process. ClrMD allows us to get the text representation of assembly code. To get the IL we use the one and only Mono.Cecil. To get the corresponding C# code we once again use ClrMD.

ClrMD can attach to the process of the same bitness. To support all scenarios (host 32bit, child 64bit and the opposite) I have put the disassembler to a separate process. This is why we have BenchmarkDotNet.Disassembler.x86.exe and BenchmarkDotNet.Disassembler.x64.exe. Both disassemblers are stored in the resources of the BenchmarkDotNet.Core.dll. When the time comes, they are copied from resources to the hard drive and executed accordingly.

.NET Core

The NuGet package of ClrMD implements .NET Core support but targets only desktop .NET. It’s not a problem because we can use our architecture to get it running for .NET Core. Whatever the host is (.NET or .NET Core) it spawns the disassembler process (a desktop .NET process) which uses ClrMD to attach to the child .NET Core process.

This is why we currently support only Windows for our .NET Core disassembler.

Mono

With great help from Miguel de Icaza, I was able to implement a simple disassembler for Mono. We just run:

mono -v -v -v -v --compile $namespace.$typeName:$methodName $exeName

and parse the output. The LLVM is supported and you don’t need to install anything except BenchmarkDotNet. The downside is that as of now the parser can handle only simple benchmarks. I did not have the time to test all edge cases.

Limitations

What we have today comes with following limitations:

.NET Core disassembler works only on Windows
Mono disassembler does not support recursive disassembling and produces output without IL and C#.
Indirect calls are not tracked.
To be able to compare different platforms, you need to target AnyCPU AnyCPU
To get the corresponding C#/F# code from disassembler you need to configure your project in following way:

pdbonly
true

How to use it?

The first step is to install BenchmarkDotNet version 0.10.10 or newer (always use latest BenchmarkDotNet for your own good!).

After this you need to apply following settings to your .csproj file:

  AnyCPU
  pdbonly
  true

Now you can enable it in two ways:

Use the new attribute (apply it on a class that contains Benchmarks):

[DisassemblyDiagnoser(printAsm: true, printSource: true)]
public class TheClassThatContainsBenchmarks { /* benchmarks go here */ }

Tell your custom config to use it:

private class CustomConfig : ManualConfig
{
    public CustomConfig()
    {
        Add(Job.Default);
        Add(DisassemblyDiagnoser.Create(new DisassemblyDiagnoserConfig(printAsm: true, recursiveDepth: 1)));
    }
}

Recursive mode

The new diagnoser supports recursive disassembling. It means that you can configure it to disassemble the benchmark itself and optionally the code that it calls. To do so you need to use the recursiveDepth parameter. Be careful with setting it to int.MaxValue. If you are curious, please try it for following benchmark:

public void Big()
{
   if(new Random(123).Next(5, 10) > 11)
       throw new InvalidOperationException("Impossible");
}

Spoiler: it produces a 50 MB file ;)

Single config for ALL JITs

You can use a single config to compare the generated assembly code for ALL JITs.

But to allow benchmarking any target platform architecture the project which defines benchmarks has to target AnyCPU.

  AnyCPU

Let’s check the Devirtualization that was introduced recently for .NET Core 2.0:

public class MultipleJits : ManualConfig
{
    public MultipleJits()
    {
        Add(Job.ShortRun.With(new MonoRuntime(name: "Mono x86", customPath: @"C:\Program Files (x86)\Mono\bin\mono.exe")).With(Platform.X86));
        Add(Job.ShortRun.With(new MonoRuntime(name: "Mono x64", customPath: @"C:\Program Files\Mono\bin\mono.exe")).With(Platform.X64));

        Add(Job.ShortRun.With(Jit.LegacyJit).With(Platform.X86).With(Runtime.Clr));
        Add(Job.ShortRun.With(Jit.LegacyJit).With(Platform.X64).With(Runtime.Clr));

        Add(Job.ShortRun.With(Jit.RyuJit).With(Platform.X64).With(Runtime.Clr));

        // RyuJit for .NET Core 1.1
        Add(Job.ShortRun.With(Jit.RyuJit).With(Platform.X64).With(Runtime.Core).With(CsProjCoreToolchain.NetCoreApp11));

        // RyuJit for .NET Core 2.0
        Add(Job.ShortRun.With(Jit.RyuJit).With(Platform.X64).With(Runtime.Core).With(CsProjCoreToolchain.NetCoreApp20));

        Add(DisassemblyDiagnoser.Create(new DisassemblyDiagnoserConfig(printAsm: true, printPrologAndEpilog: true, recursiveDepth: 3)));
    }
}

[Config(typeof(MultipleJits))]
public class Jit_Devirtualization
{
    private Increment increment = new Increment();

    [Benchmark]
    public int CallVirtualMethod() => increment.OperateTwice(10);

    public abstract class Operation  // abstract unary integer operation
    {
        public abstract int Operate(int input);

        public int OperateTwice(int input) => Operate(Operate(input)); // two virtual calls to Operate
    }

    public sealed class Increment : Operation // concrete, sealed operation: increment by fixed amount
    {
        public readonly int Amount;
        public Increment(int amount = 1) { Amount = amount; }

        public override int Operate(int input) => input + Amount;
    }
}

The results:

BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  Job-UBMWVM : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.2053.0;compatjit-v4.7.2053.0
  Job-JDGXXX : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2053.0
  Job-PXPXXE : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  Job-DULNTX : .NET Core 1.1.2 (Framework 4.6.25211.01), 64bit RyuJIT
  Job-GAPDXO : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT
  Job-ZXJTYF : Mono 4.4.1 (Visual Studio), 64bit 
  Job-NBVNXQ : Mono 5.2.0 (Visual Studio), 32bit 

LaunchCount=1  TargetCount=3  WarmupCount=3  

Method	Jit	Platform	Runtime	Toolchain	Mean	Error	StdDev
CallVirtualMethod	LegacyJit	X64	Clr	Default	3.222 ns	0.2984 ns	0.0169 ns
CallVirtualMethod	LegacyJit	X86	Clr	Default	3.012 ns	0.3651 ns	0.0206 ns
CallVirtualMethod	RyuJit	X64	Clr	Default	2.928 ns	0.2941 ns	0.0166 ns
CallVirtualMethod	RyuJit	X64	Core	.NET Core 1.1	2.920 ns	0.1688 ns	0.0095 ns
CallVirtualMethod	RyuJit	X64	Core	.NET Core 2.0	2.222 ns	0.6163 ns	0.0348 ns
CallVirtualMethod	RyuJit	X64	Mono x64	Default	5.114 ns	0.5626 ns	0.0318 ns
CallVirtualMethod	RyuJit	X86	Mono x86	Default	9.610 ns	0.2672 ns	0.0151 ns

The disassembly result can be obtained here. The file was too big to embed it in this blog post.

Getting only the Disassembly without running the benchmarks for a long time

Sometimes you might be interested only in the disassembly, not the results of the benchmarks. In that case you can use Job.Dry which runs the benchmark only once.

public class JustDisassembly : ManualConfig
{
    public JustDisassembly()
    {
        Add(Job.Dry.With(Jit.RyuJit).With(Platform.X64).With(Runtime.Core).With(CsProjCoreToolchain.NetCoreApp20));
        Add(Job.Dry.With(Jit.RyuJit).With(Platform.X64).With(Runtime.Core).With(CsProjCoreToolchain.NetCoreApp21));

        Add(DisassemblyDiagnoser.Create(new DisassemblyDiagnoserConfig(printAsm: true, printPrologAndEpilog: true, recursiveDepth: 3)));
    }
}

The Ultimate Combination

Some time ago I have implemented Hardware Counters diagnoser for BenchmarkDotNet. Ever since then I wanted to combine the Instruction Pointers that comes with the events with the code.

Now it was finally possible. ClrMD gives me the asm with IPs, ETW gives me hardware counters with IPs. That’s all I need.

Let’s use both diagnosers to answer the famous “Why is it faster to process a sorted array than an unsorted array? ”.

class Program
{
    static void Main(string[] args) => BenchmarkRunner.Run<Cpu_BranchPerdictor>();
}

[HardwareCounters(HardwareCounter.BranchMispredictions, HardwareCounter.BranchInstructions)]
[DisassemblyDiagnoser(printAsm: true, printSource: true)]
public class Cpu_BranchPerdictor
{
    private const int N = 32767;
    private readonly int[] sorted, unsorted;

    public Cpu_BranchPerdictor()
    {
        var random = new Random(0);
        unsorted = new int[N];
        sorted = new int[N];
        for (int i = 0; i < N; i++)
            sorted[i] = unsorted[i] = random.Next(256);
        Array.Sort(sorted);
    }

    private static int Branch(int[] data)
    {
        int sum = 0;
        for (int i = 0; i < N; i++)
            if (data[i] >= 128)
                sum += data[i];
        return sum;
    }

    [Benchmark]
    public int SortedBranch() => Branch(sorted);

    [Benchmark]
    public int UnsortedBranch() => Branch(unsorted);
}

The results:

BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0

Method	Mean	Error	StdDev	Mispredict rate	BranchInstructions/Op	BranchMispredictions/Op
SortedBranch	21.15 us	0.0550 us	0.0488 us	0,11%	61712	65
UnsortedBranch	135.32 us	0.7503 us	0.7018 us	21,90%	80158	17555

The new report:

How it works

When we attach with ClrMD to the benchmarked process we ask it for the asm instructions for given address. The address is Instruction Pointer (IP).

The other diagnoser is using ETW to gather the PMC events. Each event comes with hardware counter type, interval, Instruction Pointer and process Id.

When we detect that user is using both diagnosers we enable Instruction Pointer exporter. It eliminates the noise (events with IPs that don’t belong to the benchmarked code like BenchmarkDotNet engine) and aggregates the results.

Skid

Please keep in mind that we just show what we get. The PMC events are usually delayed. They are collected in Event-Based Sampling (EBS) mode. When the event occurs, the counter increments and when it reaches the max interval value the event is fired with current Instruction Pointer (good explanation). We try to overcome the side effects of this by running a lot of iterations of the benchmarked code. If your processor support PEBS it should also help.

As you can see instructions without branches report branching events. I used arrows to show the real instructions for each branch.

If you are interested to learn more about skid testing I encourage you to try simple but very smart “Processor PMC event skid testing” by Brendan Gregg. In his case it was over 99% skids.

Span

2017-07-13T00:00:00+00:00

tl;dr Use Span to work with ANY kind of memory in a safe and very efficient way. Simplify your APIs and use the full power of unmanaged memory!

Introduction
The Problem
Span is the Solution
Performance
The Limitations
Summary
Sources

Introduction

C# gives us great flexibility when it comes to using different kinds of memory. But the majority of the developers use only the managed one. Let’s take a brief look at what C# has to offer for us:

Stack memory - allocated on the Stack with the stackalloc keyword. Very fast allocation and deallocation. The size of the Stack is very small (usually < 1 MB) and fits well into CPU cache. But when you try to allocate more, you get StackOverflowException which can not be handled and immediately kills the entire process. Usage is also limited by the very short lifetime of the stack - when the method ends, the stack gets unwinded together with its memory. Stackalloc is commonly used for short operations that must not allocate any managed memory. An example is very fast logging of ETW events in corefx: it has to be as fast as possible and needs very little of memory (so the size limitation is not a problem).

internal unsafe void BufferRented(int bufferId, int bufferSize, int poolId, int bucketId)
{
    EventData* payload = stackalloc EventData[4];
    payload[0].Size = sizeof(int);
    payload[0].DataPointer = ((IntPtr)(&bufferId));
    payload[1].Size = sizeof(int);
    payload[1].DataPointer = ((IntPtr)(&bufferSize));
    payload[2].Size = sizeof(int);
    payload[2].DataPointer = ((IntPtr)(&poolId));
    payload[3].Size = sizeof(int);
    payload[3].DataPointer = ((IntPtr)(&bucketId));
    WriteEventCore(1, 4, payload);
}

Unmanaged memory - allocated on the unmanaged heap (invisible to GC) by calling Marshal.AllocHGlobal or Marshal.AllocCoTaskMem methods. This memory must be released by the developer with an explicit call to Marshal.FreeHGlobal or Marshal.FreeCoTaskMem. By using it we don’t add any extra pressure for the GC. It’s most commonly used to avoid GC in scenarios where you would normally allocate huge arrays of value types without pointers. Here you can see some real-life use cases from Kestrel.
Managed memory - We can allocate it with the new operator. It’s called managed because it’s managed by the Garbage Collector (GC). GC decides when to free the memory, the developer doesn’t need to worry about it. As described in one of my previous blog posts, the GC divides managed objects into two categories:
- Small objects (size < 85 000 bytes) - allocated in the generational part of the managed heap. The allocation of small objects is fast. When they are promoted to older generations, their memory is usually being copied. The deallocation is non-deterministic and blocking. Short-lived objects are cleaned up in the very fast Gen 0 (or Gen 1) collection. The long living ones are subject of the Gen 2 collection, which usually is very time-consuming.
- Large objects (size >= 85 000 bytes) - allocated in the Large Object Heap (LOH). Managed with the free list algorithm, which offers slower allocation and can lead to memory fragmentation. The advantage is that large objects are by default never copied. This behavior can be changed on demand. LOH has very expensive deallocation (Full GC) which can be minimized by using ArrayPool.

Note: When I say that given GC operation is slow I don’t mean that the .NET GC is slow. .NET has a great GC, but No GC is better than any GC.

The Problem

When was the last time you have seen a public .NET API that was accepting pointers? During my recent NDC Oslo talk, I have asked the audience who has ever used stackalloc. One person (Kristian) from more than one hundred has raised the hand. Why is it so uncommon to use the native memory in C#?

Let’s try to answer this question by designing an API for parsing integers for all kinds of memory.

We start with a string which is a managed representation of buffer of characters.

int Parse(string managedMemory); // allows us to parse the whole string

What if we want to parse only selected part of the string?

int Parse(string managedMemory, int startIndex, int length); // allows us to parse part of the string

Ok, so let’s support the unmanaged memory now:

unsafe int Parse(char* pointerToUnmanagedMemory, int length); // allows us to parse characters stored on the unmanaged heap / stack
unsafe int Parse(char* pointerToUnmanagedMemory, int startIndex, int length); // allows us to parse part of the characters stored on the unmanaged heap / stack

It’s already four overloads and I am pretty sure that I have missed something ;)

Now let’s design an API for copying blocks of memory:

void Copy<T>(T[] source, T[] destination); 
void Copy<T>(T[] source, int sourceStartIndex, T[] destination, int destinationStartIndex, int elementsCount);
unsafe void Copy<T>(void* source, void* destination, int elementsCount);
unsafe void Copy<T>(void* source, int sourceStartIndex, void* destination, int destinationStartIndex, int elementsCount);
unsafe void Copy<T>(void* source, int sourceLength, T[] destination);
unsafe void Copy<T>(void* source, int sourceStartIndex, T[] destination, int destinationStartIndex, int elementsCount);

Update: We don’t need to worry about handling long parameters. The Array in .NET has a method GetLongLength but it never returns value bigger than int.Max.

As you can see supporting any kind of memory was previously hard and problematic.

Span is the Solution

Span (previously called Slice) is a simple value type that allows us to work with any kind of contiguous memory:

Unmanaged memory buffers
Arrays and subarrays
Strings and substrings

It ensures memory and type safety and has almost no overhead.

Prerequisites

To work with Span you need to install the latest System.Memory NuGet package and set LangVersion to C# 7.2 or newer.

7.2

Note: Older compilers might give you errors like this:

Error CS8107 Feature 'ref structs' is not available in C# 7.0. Please use language version 7.2 or greater.

Using Span

I would say that you can think of it like of an array, which does all the pointer arithmetic for you, but internally can point to any kind of memory.

We can create Span for unmanaged memory:

Span<byte> stackMemory = stackalloc byte[256]; // C# 7.2 

IntPtr unmanagedHandle = Marshal.AllocHGlobal(256);
Span<byte> unmanaged = new Span<byte>(unmanagedHandle.ToPointer(), 256); 
Marshal.FreeHGlobal(unmanagedHandle);

There is even implicit cast operator from T[] to Span:

char[] array = new char[] { 'i', 'm', 'p', 'l', 'i', 'c', 'i', 't' };
Span<char> fromArray = array; // implicit cast

There is also ReadOnlySpan which can be used to work with strings or other immutable types.

ReadOnlySpan<char> fromString = "Strings in .NET are immutable".AsSpan();

Once you create it, you can use it in a way you would typically use an array - it has a Length property and an [indexer], which allows to get and set the values.

The simplified API (full can be found here):

Span(T[] array);
Span(T[] array, int startIndex);
Span(T[] array, int startIndex, int length);
unsafe Span(void* memory, int length);

int Length { get; }
ref T this[int index] { get; set; }

Span<T> Slice(int start);
Span<T> Slice(int start, int length);

void Clear();
void Fill(T value);

void CopyTo(Span<T> destination);
bool TryCopyTo(Span<T> destination);

API Simplicity

Let’s redesign parsing and copying APIs mentioned earlier and take advantage of Span:

int Parse(ReadOnlySpan<char> anyMemory);
int Copy<T>(ReadOnlySpan<T> source, Span<T> destination);

As simple as it gets! Span abstracts almost everything about memory. It makes using the unmanaged memory much easier both for APIs producers and consumers.

How does it work?

There are two versions of Span:

For the runtimes existing prior to Span.
For the new runtimes, which implement native support for Spans.

The version for the existing runtimes is implemented in corefx. .NET Core 2.0 is so far the only runtime that implements native support for Span.

The Span for existing Runtimes consists of three fields: reference (represented by simple reference type field), byteOffset (IntPtr) and length (int, not long). When we access n-th value, the indexer does the pointer arithmetic for us (pseudocode):

ref T this[int index]
{
    get => ref ((ref reference + byteOffset) + index * sizeOf(T));
}

The new GC knows how to deal with Span, the reference and byteOffset fields got merged into an interior pointer. New GC is aware of the fact that it’s merged reference and it has the native support for updating this reference when it’s needed during the Compact phase of the garbage collection (when the underlying object like array is moved, the reference needs to be updated, the offset needs to remain untouched).

IL and C# do not support ref T fields. .NET Core 2.0+ implements it by representing it via ByReference type, which is just a JIT intrinsic.

Performance

Slicing without managed heap allocations!

Slicing (taking part of some memory) is core feature of Span. It does not copy any memory, it simply creates Span with different pointer and length (and offset for the old runtimes).

Why is it so important? Because so far in .NET anytime we wanted to substring a string the .NET was allocating new string for us and copying the desired content to it. Pseudocode:

string Substring(string text, int startIndex, int length)
{
    string result = new string(length); // ALLOCATION!
        
    Memory.Copy(source: text, destinaiton: result, startIndex, length); // COPYING MEMORY
        
    return result;
}

With Span, there is no allocation! Pseudocode:

ReadOnlySpan<char> Slice(string text, int startIndex, int length)
    => new ReadOnlySpan<char>(
        ref text[0] + (startIndex * sizeof(char)), 
        length);

Let’s measure the difference with BenchmarkDotNet:

[MemoryDiagnoser]
[Config(typeof(DontForceGcCollectionsConfig))]
public class SubstringVsSubslice
{
    private string Text;

    [Params(10, 1000)]
    public int CharactersCount { get; set; }

    [GlobalSetup]
    public void Setup() => Text = new string(Enumerable.Repeat('a', CharactersCount).ToArray());

    [Benchmark]
    public string Substring() => Text.Substring(0, Text.Length / 2);

    [Benchmark(Baseline = true)]
    public ReadOnlySpan<char> Slice() => Text.AsSpan().Slice(0, Text.Length / 2);
}

BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)
Processor=Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  Job-NJYLUU : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0

Force=False  

Method	CharactersCount	Mean	StdDev	Scaled	Gen 0	Allocated
Substring	10	8.277 ns	0.1938 ns	4.54	0.0191	40 B
Slice	10	1.822 ns	0.0383 ns	1.00	-	0 B
Substring	1000	85.518 ns	1.3474 ns	47.22	0.4919	1032 B
Slice	1000	1.811 ns	0.0205 ns	1.00	-	0 B

It’s clear that:

Slicing does not allocate any managed heap memory. Substring allocates. (Allocated column)
Slicing is much faster (Mean column)
Slicing has constant cost! Look at the values for Standard Deviation and Mean for CharactersCount = 10 and 1000!

Slow vs Fast Span

Some people call the 3-field Span “slow Span” and the 2-field “fast Span”. BenchmarkDotNet allows running same benchmark for multiple runtimes. Let’s use it and compare the indexer for .NET 4.6, .NET Core 1.1 and .NET Core 2.0.

[Config(typeof(MultipleRuntimesConfig))]
public class SpanIndexer
{
    protected const int Loops = 100;
    protected const int Count = 1000;

    protected byte[] arrayField;

    [GlobalSetup]
    public void Setup() => arrayField = Enumerable.Repeat(1, Count).Select((val, index) => (byte)index).ToArray();

    [Benchmark(OperationsPerInvoke = Loops * Count)]
    public byte SpanIndexer_Get()
    {
        Span<byte> local = arrayField; // implicit cast to Span, we can't have Span as a field!
        byte result = 0;
        for (int _ = 0; _ < Loops; _++)
        {
            for (int j = 0; j < local.Length; j++)
            {
                result = local[j];
            }
        }
        return result;
    }

    [Benchmark(OperationsPerInvoke = Loops * Count)]
    public void SpanIndexer_Set()
    {
        Span<byte> local = arrayField; // implicit cast to Span, we can't have Span as a field!
        for (int _ = 0; _ < Loops; _++)
        {
            for (int j = 0; j < local.Length; j++)
            {
                local[j] = byte.MaxValue;
            }
        }
    }
}

public class MultipleRuntimesConfig : ManualConfig
{
    public MultipleRuntimesConfig()
    {
        Add(Job.Default
                .With(CsProjClassicNetToolchain.Net46) // Span NOT supported by Runtime
                .WithId(".NET 4.6"));

        Add(Job.Default
                .With(CsProjCoreToolchain.NetCoreApp11) // Span NOT supported by Runtime
                .WithId(".NET Core 1.1"));

        Add(Job.Default
                .With(CsProjCoreToolchain.NetCoreApp20) // Span SUPPORTED by Runtime
                .WithId(".NET Core 2.0"));
    }
}

BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)
Processor=Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
  [Host]        : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  .NET 4.6      : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  .NET Core 1.1 : .NET Core 4.6.25211.01, 64bit RyuJIT
  .NET Core 2.0 : .NET Core 4.6.25302.01, 64bit RyuJIT

Method	Job	Mean	StdDev
SpanIndexer_Get	.NET 4.6	0.6054 ns	0.0007 ns
SpanIndexer_Get	.NET Core 1.1	0.6047 ns	0.0008 ns
SpanIndexer_Get	.NET Core 2.0	0.5333 ns	0.0006 ns
SpanIndexer_Set	.NET 4.6	0.6059 ns	0.0007 ns
SpanIndexer_Set	.NET Core 1.1	0.6042 ns	0.0002 ns
SpanIndexer_Set	.NET Core 2.0	0.5205 ns	0.0003 ns

The difference is around 12-14%. In my opinion, it proves that people should not be afraid of using the “slow” Span for existing runtimes. But there is some place for further improvement for the new runtimes! So the gap might get bigger soon.

Note: Please keep in mind that this benchmark is not perfect indexer benchmark. It relies heavily on the CPU cache. I am not sure if it is even possible to design a perfect benchmark for the indexer.

Span vs Array

When we take a look at the official requirements for Span we can find:

“Performance characteristics on par with arrays.”

Let’s measure it ;)

public class SpanVsArray_Indexer : SpanIndexer
{
    [Benchmark(OperationsPerInvoke = Loops * Count)]
    public byte ArrayIndexer_Get()
    {
        var local = arrayField;
        byte result = 0;
        for (int _ = 0; _ < Loops; _++)
        {
            for (int j = 0; j < local.Length; j++)
            {
                result = local[j];
            }
        }
        return result;
    }

    [Benchmark(OperationsPerInvoke = Loops * Count)]
    public void ArrayIndexer_Set()
    {
        var local = arrayField;
        for (int _ = 0; _ < Loops; _++)
        {
            for (int j = 0; j < local.Length; j++)
            {
                local[j] = byte.MaxValue;
            }
        }
    }
}

BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)
Processor=Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
  [Host]        : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  .NET 4.6      : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  .NET Core 1.1 : .NET Core 4.6.25211.01, 64bit RyuJIT
  .NET Core 2.0 : .NET Core 4.6.25302.01, 64bit RyuJIT

Method	Job	Mean	StdDev
ArrayIndexer_Get	.NET 4.6	0.5499 ns	0.0009 ns
SpanIndexer_Get	.NET 4.6	0.6073 ns	0.0016 ns
ArrayIndexer_Get	.NET Core 1.1	0.5455 ns	0.0006 ns
SpanIndexer_Get	.NET Core 1.1	0.6062 ns	0.0008 ns
ArrayIndexer_Get	.NET Core 2.0	0.5401 ns	0.0019 ns
SpanIndexer_Get	.NET Core 2.0	0.5357 ns	0.0010 ns
ArrayIndexer_Set	.NET 4.6	0.5028 ns	0.0010 ns
SpanIndexer_Set	.NET 4.6	0.6057 ns	0.0005 ns
ArrayIndexer_Set	.NET Core 1.1	0.5074 ns	0.0013 ns
SpanIndexer_Set	.NET Core 1.1	0.6056 ns	0.0008 ns
ArrayIndexer_Set	.NET Core 2.0	0.5069 ns	0.0010 ns
SpanIndexer_Set	.NET Core 2.0	0.5219 ns	0.0005 ns

The requirement is met only for the new runtime with native Span support, .NET Core 2.0. Personally, I don’t believe that it’s possible the existing runtimes can meet this requirement. There is no way to add some features like array bound check elimination for the existing runtimes.

The Limitations

Span supports any kind of memory. It means that is should have the same restrictions as the most demanding type of memory.

In our case, it’s stack memory. Pointer to stack must not be stored on the managed heap. When the method ends, the stack gets unwinded and the pointer becomes invalid. If we somehow store it on the heap, bad things are going to happen. Anything else that contains a pointer to stack also must not be stored on the managed heap. It means that Span must not be stored on the managed heap.

Moreover, Span as a value type with more than one field suffers from Struct Tearing. Span is supposed to be very fast, so we can not solve struct tearing issue by introducing access synchronization.

Stack-only

Span is stack-only type:

Span instances can reside only on the Stack
Stacks are not shared across multiple threads, so single Stack is accessed by one thread at the same time. It ensures thread safety for Span!
Stacks are short-lived, which means that GC will track fewer pointers. If we would let them live long (on the heap), we could get to a situation where Span creates big overhead for GC.

No Heap

Because of the fact that Span is a stack-only type, it must not be stored on the heap. Which leads us to a long list of limitations.

Span must not be a field in non-stackonly type

If you make Span a field in a class it will be stored on the heap. This is prohibited!

class Impossible
{
    Span<byte> field;
}

Since C# 7.2 it is be possible to have a Span field in other stack-only type.

ref struct TwoSpans<T>
{
	// can have ref-like instance fields
	public Span<T> first;
	public Span<T> second;
} 

Span must not implement any existing interface

Let’s consider following C# and IL code:

void NonConstrained<T>(IEnumerable<T> collection)
struct SomeValueType<T> : IEnumerable<T> { }

void Demo()
{
    var value = new SomeValueType<int>();
    NonConstrained(value);
}

The value got boxed! Which means stored on the heap. Which is prohibited for the Span. You can read more about boxing in my previous blog post.

The point is that to prevent from boxing Span must not implement any existing interface like IEnumerable. If in the future C# allows defining an interface that can be implemented only by a struct, then it might become possible.

Span must not be a parameter for async method

async and await are awesome C# features. They make our life easier by solving a lot of problems and hiding a lot complexity from us.

But whenever async & await are used, an AsyncMethodBuilder is created. The builder creates an asynchronous state machine. Which at some point of time might put the parameters of the method on the heap. This is why Span must not be an argument for an async method.

Span must not be a generic type argument

Let’s consider following C# code:

Span<byte> Allocate() => new Span<byte>(new byte[256]);

void CallAndPrint<T>(Func<T> valueProvider) // no generic requirements for T
{
    object value = valueProvider.Invoke(); // boxing!

    Console.WriteLine(value.ToString());
}

void Demo()
{
    Func<Span<byte>> spanProvider = Allocate;
    CallAndPrint<Span<byte>>(spanProvider);
}

As you can see the non-boxing requirement can not be ensured today if we allow the Span to be generic type argument. One of the possible solutions could be to introduce new generic constraint: stackonly. But then all the managed compilers would have to respect it and ensure the lack of boxing and other restrictions. This is why it was decided to simply forbid using Span as a generic argument.

Initially, this requirement was verified at runtime by .NET Core 2.0. Since C# 7.2 it’s also enforced by the compiler at compile time.

Memory

Memory is a new type which can point only to managed memory, so it does not have stack-only limitation. It can be created out of a managed array, string or IOwnedMemory, passed to async method(s) or stored in the field of a class. When you need Span, you just call the .Span property, which creates Span on demand. Then you use it within the current scope.

public readonly struct Memory<T>
{
    private readonly object _object; // String, Array or OwnedMemory
    private readonly int _index;
    private readonly int _length;

    public Span<T> Span { get; }

    public Memory<T> Slice(int start)
    public Memory<T> Slice(int start, int length)
    public MemoryHandle Pin()
}

Sample usage:

byte[] buffer = ArrayPool<byte>.Shared.Rent(16000 * 8); // we use an Array here, not Span

while ((bytesRead = await fileStream.ReadAsync(buffer, 0, buffer.Length)) > 0) // AWAIT! writing to array
{
    ParseBlock(new ReadOnlyMemory<byte>(buffer, start: 0, length: bytesRead)); // creating ReadOnlyMemory which points to managed array
}

void ParseBlock(ReadOnlyMemory<byte> memory)
{
    ReadOnlySpan<byte> slice = memory.Span; // using Span from here
}

Summary

Allows to work with any type of memory.
System.Memory package, C# 7.2.
It makes working with native memory much easier.
Simple abstraction over Pointer Arithmetic.
Avoid allocation and copying of memory with Slicing.
Supports .NET Standard 1.1+
Its performance is on par with Array for new runtimes.
It’s limited due to stack only requirements.

Sources

Span design document
Compile time enforcement of safety for ref-like types C# 7.2 feature description by Vladimir Sadov
.NET Standard article by MSDN
Span issue in coreclr repo
Span issue in C# language repo

ref returns and ref locals

2017-07-04T00:00:00+00:00

tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than unsafe!

Introduction

Since C# 1.0 we could pass arguments to methods by reference. It means that instead of copying value types every time we pass them to a method we can just pass them by reference. It allows us to overcome one of the very few disadvantages of value types which I described in my previous blog post “Value Types vs Reference Types”.

Passing is not enough to cover all scenarios. C# 7.0 adds new possibilities: declaring references to local variables and returning by reference from methods.

Note: I want to focus on the performance aspect here. If you want to learn more about ref returns and ref locals you should read these awesome blog posts from Vladimir Sadov. He is the software engineer who has implemented this feature for C# compiler. So you can get it straight from the horse’s mouth!

Reminder

Let’s analyse some simple C# examples to make sure that we have good common understanding of the syntax.

void method(ref int argument) - The argument is passed to the method by reference.
```
int localVariable = 123;
ref int localReference = ref localVariable;
```
ref localVariable - Dereferencing local variable. If you have C++ background you can think of it as of *localVariable
ref int localReference - Defining local reference. localReference is an alias of an existing variable.
ref array[0] - Dereferencing array’s first element.
ref int method() - The result of the method is passed by reference. The method still returns an int. Not a pointer!

Passing arguments to methods by reference

In C# all parameters are passed to methods by value by default. It means that the Value Type instance is copied every time we pass it to a method. Or when we return it from a method. The bigger the Value Type is, the more expensive it is to copy it. For small value types, the JIT compiler might optimize the copying (inline the method, use registers for copying & more).

We can pass arguments to methods by reference. It’s not a new feature, it was part of C# 1.0. Anyway, I am going to measure it to make sure that it actually improves the performance. Once again I am using BenchmarkDotNet for benchmarking.

Benchmarks

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job] // run the benchmarks for all available jits
public class PassingByReference
{
    struct BigStruct
    {
        public int Int1, Int2, Int3, Int4, Int5;
    }

    private BigStruct field = new BigStruct();

    [Benchmark(OperationsPerInvoke = 16, Baseline = true)]
    public void PassByValue()
    {
        var copy = field; // access the field only once to not influence the benchmark too much
        Method(copy); Method(copy); Method(copy); Method(copy);
        Method(copy); Method(copy); Method(copy); Method(copy);
        Method(copy); Method(copy); Method(copy); Method(copy);
        Method(copy); Method(copy); Method(copy); Method(copy);
    }

    [Benchmark(OperationsPerInvoke = 16)]
    public void PassByReference()
    {
        ref var local = ref field; // access the field only once to not influence the benchmark too much
        Method(ref local); Method(ref local); Method(ref local); Method(ref local);
        Method(ref local); Method(ref local); Method(ref local); Method(ref local);
        Method(ref local); Method(ref local); Method(ref local); Method(ref local);
        Method(ref local); Method(ref local); Method(ref local); Method(ref local);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Method(BigStruct value) { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    void Method(ref BigStruct value) { }
}

Results

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338342 Hz, Resolution=427.6534 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  LegacyJitX64 : Clr 4.0.30319.42000, 64bit LegacyJIT/clrjit-v4.7.2053.0;compatjit-v4.7.2053.0
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.7.2053.0
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0

Runtime=Clr  

Method	Jit	Platform	Mean	Scaled
PassByValue	LegacyJit	X86	2.868 ns	1.00
PassByReference	LegacyJit	X86	1.434 ns	0.50

As you can see the 32bit JIT is struggling with copying large value types. Passing them by reference gave us x2 speed up in this scenario. But we have measured only the time required to pass an argument to a method. If a method is complex and time-consuming itself, the performance improvement might be very small. The smaller the method, the bigger improvement you will see.

Method	Jit	Platform	Mean	Scaled
PassByValue	LegacyJit	X64	2.062 ns	1.00
PassByReference	LegacyJit	X64	1.470 ns	0.71
PassByValue	RyuJit	X64	2.098 ns	1.00
PassByReference	RyuJit	X64	1.593 ns	0.76

For 64 bit the difference is smaller but still noticeable.

We can say that passing argument by reference can bring you some benefits, but you should not expect x10 time improvement. The code gets also more complex. Measure your scenario and if you prove that it brings you worthy performance improvement then use it.

Local references

Using local references is another way to avoid copying of memory. Let’s try to initialize an array of large value types and see how fast we can get with ref locals.

Benchmarks

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job]
[RPlotExporter] // use R to get nice charts!
[CsvMeasurementsExporter] // use R to get nice charts!
public class InitializingBigStructs
{
    struct BigStruct
    {
        public int Int1, Int2, Int3, Int4, Int5;
    }

    private BigStruct[] array;

    [GlobalSetup]
    public void Setup() => array = new BigStruct[1000];

    [Benchmark]
    public void ByValue()
    {
        BigStruct[] variable = array;
        for (int i = 0; i < variable.Length; i++)
        {
            BigStruct value = variable[i]; // copy the value 1st time

            value.Int1 = 1;
            value.Int2 = 2;
            value.Int3 = 3;
            value.Int4 = 4;
            value.Int5 = 5;

            variable[i] = value; // copy the value 2nd time
        }
    }

    [Benchmark(Baseline = true)]
    public void ByReference()
    {
        BigStruct[] variable = array;
        for (int i = 0; i < variable.Length; i++)
        {
            ref BigStruct reference = ref variable[i]; // create local alias to array storage

            reference.Int1 = 1;
            reference.Int2 = 2;
            reference.Int3 = 3;
            reference.Int4 = 4;
            reference.Int5 = 5;
        }
    }
}

Note: This scenario could have been handled without ref locals. We could simply pass the argument by reference like this:

public void ByReferenceOldWay()
{
    for (int i = 0; i < array.Length; i++)
    {
        Init(ref array[i]);
    }
}

// try it with: [MethodImpl(MethodImplOptions.NoInlining)]
private void Init(ref BigStruct reference)
{
    reference.Int1 = 1;
    reference.Int2 = 2;
    reference.Int3 = 3;
    reference.Int4 = 4;
    reference.Int5 = 5;
}

Results

This time I have used RPlotExporter which produces some fancy charts if you have R installed and R_HOME environment variable configured. More info can be obtained here.

As you can see the difference is HUGE. In this simple scenario, we got x4.51 performance improvement for RyuJit! 5.58 for LegacyJitX64 and 5.05 for LegacyJitX86. It’s clear that this feature can be very useful in similar scenarios.

But how is it possible that the new ref locals feature works with the legacy Jits? Have Microsoft released a Windows patch with .NET framework improvements? No! C# features like ref parameters, locals and returns are just using the existing feature of CLR called managed pointers. So to use it you just need IDE with Roslyn 2.0 version (Visual Studio 2017 or Rider) and you can deploy the code to your client’s old virtual machine which might be running with some very old .NET framework ;)

Returning references

Returning results by reference can bring us performance gains similar to passing arguments by reference. This is why I am not going to run any benchmarks here. The very important thing is that they can help us to make things like Span come true. I’ll describe Span in my next blog post. Stay tuned!

Safety

As you most probably know C# allows us to use unsafe C++-like pointers. Unsafe code can not pass the IL verification, which is one of the CLR mechanisms that ensure type and memory safety (PEVerify). This is why the code that is using unsafe requires FullTrust to be executed. It might be not an option in some environments. I remember that long time ago the default settings in Azure were not allowing unsafe code to run. This is why a lot of common high-performance .NET libraries are not using unsafe. mscorlib.dll is using unsafe, but it is exceptionally not verified during runtime ;)

Let’s compare the performance of safe vs unsafe in the scenario described previously.

Benchmarks

[Benchmark]
public unsafe void ByReferenceUnsafe()
{
    BigStruct[] variable = array;
    fixed (BigStruct* pinned = variable)
    {
        for (int i = 0; i < variable.Length; i++)
        {
            BigStruct* pointer = &pinned[i];
            (*pointer).Int1 = 1;
            (*pointer).Int2 = 2;
            (*pointer).Int3 = 3;
            (*pointer).Int4 = 4;
            (*pointer).Int5 = 5;
        }
    }
}

Results

Method	Jit	Platform	Mean	Scaled
ByReference	LegacyJit	X64	1.649 us	1.00
ByReferenceUnsafe	LegacyJit	X64	1.721 us	1.04
ByReference	LegacyJit	X86	1.666 us	1.00
ByReferenceUnsafe	LegacyJit	X86	1.673 us	1.00
ByReference	RyuJit	X64	1.684 us	1.00
ByReferenceUnsafe	RyuJit	X64	1.709 us	1.02

To our surprise, the safe way is faster than unsafe! Why is that?

When we are using safe references, we don’t need to pin objects in memory. GC understand managed pointers and knows how to update them when it’s compacting the memory. With unsafe this is not true, the managed memory needs to be pinned before it can be used. As mentioned by Victor Baybekov on Twitter, using fixed keyword prevent from inlining.

Note: This micro-benchmark is not including the side effects of pinning memory. If you pin many managed arrays in memory, then the GC has a lot of extra work to do when it’s compacting the memory. Once again I will redirect you to Pro .NET Performance book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow which has a whole chapter dedicated to Garbage Collection in .NET.

Managed pointers arithmetic

C# 7.0 is not exposing managed pointers arithmetic, which is a part of the IL language. But the System.Runtime.CompilerServices.Unsafe class does. You can use it whenever you want to compare the references, move them by given offset etc.

You can find it the System.Runtime.CompilerServices.Unsafe NuGet package. It targets .NET Standard 1.0 so you can use it in both .NET 4.5+ and .NET Core 1.0 apps. Not to speak about other frameworks that implement the standard. The api is following:

namespace System.Runtime.CompilerServices
{
    public static partial class Unsafe
    {
        public static ref T AddByteOffset<T>(ref T source, System.IntPtr byteOffset) 
        public static ref T Add<T>(ref T source, int elementOffset)
        public static ref T Add<T>(ref T source, System.IntPtr elementOffset) 
        public static bool AreSame<T>(ref T left, ref T right)
        public unsafe static void* AsPointer<T>(ref T value)
        public unsafe static ref T AsRef<T>(void* source)
        public static T As<T>(object o) where T : class
        public static ref TTo As<TFrom, TTo>(ref TFrom source)
        public static System.IntPtr ByteOffset<T>(ref T origin, ref T target) 
        public static int SizeOf<T>()
        public static ref T SubtractByteOffset<T>(ref T source, System.IntPtr byteOffset) 
        public static ref T Subtract<T>(ref T source, int elementOffset)
        public static ref T Subtract<T>(ref T source, System.IntPtr elementOffset) 
    }
}

Note: it’s not full api. Some methods were removed for brevity. You can find the single .il file that contains the implementation in the corefx repo

Current limitations

C# 7.0, which is the current version for C# language as of today (3rd of July 2017) does not allow to:

use readonly references (you can not dereference a readonly field)
treat this as readonly references for readonly structs
define by-ref fields
define by-ref extension methods. It’s possible today with Visual Basic!
use conditional operator with refs (condition ? ref left : ref right)

Hopefully all these limitations are going to be addressed by C# 7.2. You can follow the issues today on GitHub:

Don’t forget to support the new C# 7.2 features! Thumbs up!

Summary

C# features like ref parameters, locals and returns can help us to avoid copying the memory.
They are using an existing CLR feature called “managed pointers”. You need modern IDE to use them, but they will run anywhere.
Using them is safe and can be faster than using unsafe pointers.
We need to measure and prove that using them is beneficial before we increase the complexity of our code.
There are many limitations as of today. Some of them will be addressed by C# 7.2. Some are already solved by the System.Runtime.CompilerServices.Unsafe api.

Sources

ref returns are not pointers blog post by Vladimir Sadov
Managed pointers blog post by Vladimir Sadov
Local variables cannot be returned by reference blog post by Vladimir Sadov
Safe to return rules for ref returns. blog post by Vladimir Sadov
Why ref locals allow only a single binding? blog post by Vladimir Sadov
Spans and ref part 1 : ref blog post by Marc Gravell
What are the implications of using unsafe code? Stack Overflow answer by Jared Par

Value Types vs Reference Types

2017-06-26T00:00:00+00:00

tl;dr structs have better data locality. Value types add much less pressure for the GC than reference types. But big value types are expensive to copy and you can accidentally box them which is bad.

Introduction

The .NET framework implements Reference Types and Value Types. C# allows us to define custom value types by using struct and enum keywords. class, delegate and interface are for reference types. Primitive types, like byte, char, short, int and long are value types, but developers can’t define custom primitive types. In Java primitive types are also value types, but Java does not expose a possibility to define custom value types for developers ;)

Value Types and Reference Types are very different in terms of performance characteristics. In my next blog posts, I am going to describe ref returns and locals, ValueTask and Span. But I need to clarify this matter first, so the readers can understand the benefits.

Note: To keep my comparison simple I am going to use ValueTuple and Tuple as the examples.

Memory Layout

Every instance of a reference type has extra two fields that are used internally by CLR.

ObjectHeader is a bitmask, which is used by CLR to store some additional information. For example: if you take a lock on a given object instance, this information is stored in ObjectHeader.
MethodTable is a pointer to the Method Table, which is a set of metadata about given type. If you call a virtual method, then CLR jumps to the Method Table and obtains the address of the actual implementation and performs the actual call.

Both hidden fields size is equal to the size of a pointer. So for 32 bit architecture, we have 8 bytes overhead and for 64 bit 16 bytes.

Value Types don’t have any additional overhead members. What you see is what you get. This is why they are more limited in terms of features. You cannot derive from struct, lock it or write finalizer for it.

RAM is very cheap. So, what’s all the fuss about?

CPU Cache

CPU implements numerous performance optimizations. One of them is cache, which is just a memory with the most recently used data.

Note: Multithreading affects CPU cache performance. In order to make it easier to understand, the following description assumes single core.

Whenever you try to read a value, CPU checks the first level of cache (L1). If it’s a hit, the value is being returned. Otherwise, it checks the second level of cache (L2). If the value is there, it’s being copied to L1 and returned. Otherwise, it checks L3 (if it’s present).

If the data is not in the cache, CPU goes to the main memory and copies it to the cache. This is called cache miss.

Latency Numbers Every Programmer Should Know

According to Latency Numbers Every Programmer Should Know going to main memory is really expensive when compared to referencing cache.

Operation	Time
L1 cache reference	1ns
L2 cache reference	4ns
Main memory reference	100 ns

So how can we reduce the ratio of cache misses?

Data Locality

CPU is smart, it’s aware of the following data locality principles:

Spatial

If a particular storage location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.
Temporal

If at one point a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.

CPU is taking advantage of this knowledge. Whenever CPU copies a value from main memory to cache, it is copying whole cache line, not just the value. A cache line is usually 64 bytes. So it is well prepared in case you ask for the nearby memory location.

The .NET Story

How the two extra fields per every reference type instance affect data locality? Let’s take a look at the following diagram which shows how many instances of ValueTuple and Tuple can fit into single cache line for 64bit architecture.

For this simple example, the difference is really huge. In our case, we could fit 8 instances of value type and 2.66 reference type.

Benchmarks!

It’s important to know the theory, but we need to run some benchmarks to measure the performance difference. Once again I am using BenchmarkDotNet and its feature called HardwareCounters which allows me to track CPU Cache Misses. Here you can find my blog post about Collecting Hardware Performance Counters with BenchmarkDotNet. The benchmark is a simple loop with read access in it’s every iteration. I would say that it’s just a CPU cache benchmark.

Note: This benchmark is not a real life scenario. In real life, your struct would most probably be bigger (usually two fields is not enough). Hence the extra overhead of two fields for reference types would have a smaller performance impact. Smaller but still significant in high-performance scenarios!

class Program
{
    static void Main(string[] args) => BenchmarkRunner.Run<DataLocality>();
}

[HardwareCounters(HardwareCounter.CacheMisses)]
[RyuJitX64Job, LegacyJitX86Job]
public class DataLocality
{
    [Params(
        100,
        1000000,
        10000000,
        100000000)]
    public int Count { get; set; } // for smaller arrays we don't get enough of Cache Miss events

    Tuple<int, int>[] arrayOfRef;
    ValueTuple<int, int>[] arrayOfVal;

    [GlobalSetup]
    public void Setup()
    {
        arrayOfRef = Enumerable.Repeat(1, Count).Select((val, index) => Tuple.Create(val, index)).ToArray();
        arrayOfVal = Enumerable.Repeat(1, Count).Select((val, index) => new ValueTuple<int, int>(val, index)).ToArray();
    }

    [Benchmark(Baseline = true)]
    public int IterateValueTypes()
    {
        int item1Sum = 0, item2Sum = 0;

        var array = arrayOfVal;
        for (int i = 0; i < array.Length; i++)
        {
            ref ValueTuple<int, int> reference = ref array[i];
            item1Sum += reference.Item1;
            item2Sum += reference.Item2;
        }

        return item1Sum + item2Sum;
    }

    [Benchmark]
    public int IterateReferenceTypes()
    {
        int item1Sum = 0, item2Sum = 0;

        var array = arrayOfRef;
        for (int i = 0; i < array.Length; i++)
        {
            ref Tuple<int, int> reference = ref array[i];
            item1Sum += reference.Item1;
            item2Sum += reference.Item2;
        }

        return item1Sum + item2Sum;
    }
}

The Results

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  

Method	Jit	Platform	Count	Mean	Scaled
IterateValueTypes	LegacyJit	X86	100	68.96 ns	1.00
IterateReferenceTypes	LegacyJit	X86	100	317.49 ns	4.60
IterateValueTypes	RyuJit	X64	100	76.56 ns	1.00
IterateReferenceTypes	RyuJit	X64	100	252.23 ns	3.29

As you can see the difference (Scaled column) is really significant!

But the CacheMisses/Op column is empty?!? What does it mean? In this case, it means that I run too few loop iterations (just 100).

An explanation for the curious: BenchmarkDotNet is using ETW to collect hardware counters. ETW is simply exposing what the hardware has to offer. Each Performance Monitoring Units (PMU) register is configured to count a specific event and given a sample-after value (SAV). For my PC the minimum Cache Miss HC sampling interval is 4000. In value type benchmark I should get Cache Miss once every 8 loop iterations (cacheLineSize / sizeOf(ValueTuple) = 64 / 8 = 8). I have 100 iterations here, so it should be 12 Cache Misses for Benchmark. But the PMU will notify ETW, which will notify BenchmarkDotNet every 4 000 events. So once every 333 (4 000 / 12) benchmark invocation. BenchmarkDotNet implements a heuristic which decides how many times the benchmarked method should be invoked. It this example the method was executed too few times to capture enough of events. So if you want to capture some hardware counters with BenchmarkDotNet you need to perform plenty of iterations! For more info about PMU you can refer to this article by Jackson Marusarz (Intel).

Method	Jit	Platform	Count	Mean	Scaled	CacheMisses/Op
IterateValueTypes	RyuJit	X64	100 000 000	88,735,182.11 ns	1.00	3545088
IterateReferenceTypes	RyuJit	X64	100 000 000	280,721,189.70 ns	3.16	8456940

The more loop iterations (Count column), the more Cache Misses events we get. For the iteration of reference types cache misses were 2.38 times more common (8456940 / 3545088).

Note: Accuracy of Hardware Counters diagnoser in BenchmarkDotNet is limited by sampling frequency and additional code performed in the benchmarked process by our Engine. It’s good but not perfect. For more accurate results you should use some profilers like Intel VTune Amplifier.

GC Impact

Reference Types are always allocated on the managed heap (it may change in the future). Heap is managed by Garbage Collector (GC). The allocation of heap memory is fast. The problem is that the deallocation is performed by non-deterministic GC. GC implements own heuristic which allows it to decide when to perform the cleanup. The cleanup itself takes some time. It means that you can not predict when the cleanup will take place and it adds extra overhead.

Value Types can be allocated both on the stack and the heap. Stack is not managed by GC. Anytime you declare a local value type variable it’s allocated on the stack. When method ends, the stack is being unwinded and the value is gone. This deallocation is super fast. And in overall we have less pressure for the GC! The pressure is not equal to zero because anyway, GC traverses stacks, so the deeper the stack the more work it might have.

But the Value Types can be also allocated on the managed heap. If you allocate an array of bytes, then the array is allocated on the managed heap. This content is transparent to GC. They are not reference type instances, so GC does not track them in any way. But when the small array of value types gets promoted to older GC generation, the content will be copied by the GC.

Benchmarks

Let’s run some benchmark that includes the cost of allocation and deallocation for Value Types and Reference Types.

[Config(typeof(AllocationsConfig))]
public class NoGC
{
    [Benchmark(Baseline = true)]
    public ValueTuple<int, int> CreateValueTuple() => ValueTuple.Create(0, 0);

    [Benchmark]
    public Tuple<int, int> CreateTuple() => Tuple.Create(0, 0);
}

public class AllocationsConfig : ManualConfig
{
    public AllocationsConfig()
    {
        var gcSettings = new GcMode
        {
            Force = false // tell BenchmarkDotNet not to force GC collections after every iteration
        };

        const int invocationCount = 1 << 20; // let's run it very fast, we are here only for the GC stats

        Add(Job
            .RyuJitX64 // 64 bit
            .WithInvocationCount(invocationCount)
            .With(gcSettings.UnfreezeCopy()));
        Add(Job
            .LegacyJitX86 // 32 bit
            .WithInvocationCount(invocationCount)
            .With(gcSettings.UnfreezeCopy()));

        Add(MemoryDiagnoser.Default);
    }
}

The Results

If you are not familiar with the output produced by BenchmarkDotNet with Memory Diagnoser enabled, you can read my dedicated blog post to find out how to read these results.

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  Job-QZDRYZ : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  Job-XFJRTH : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  Force=False  InvocationCount=1048576  

Method	Jit	Platform	Gen 0	Allocated
CreateValueTuple	LegacyJit	X86	-	0 B
CreateTuple	LegacyJit	X86	0.0050	16 B
CreateValueTuple	RyuJit	X64	-	0 B
CreateTuple	RyuJit	X64	0.0076	24 B

As you can see, creating Value Types means No GC (- in Gen 0 column).

Note: If value type contains reference types GC will emit write barriers for write access to the reference fields. So No GC is not 100% true for value types that contain references.

Boxing

Whenever a reference is required value types are being boxed. When the CLR boxes a value type, it wraps the value inside a System.Object and stores it on the managed heap. GC tracks references to boxed Value Types! This is something you definitely want to avoid.

Obvious boxing example:

string CallToString(object input) => input.ToString();

int value = 123;
var text = CallToString(value);

CallToString accepts object. CLR needs to box the value before passing it to this method. It’s clear when you analyse the IL code:

Note: You can use ReSharper’s Heap Allocation Viewer plugin to detect boxing in your code.

Invoking interface methods with Value Types

The previous example was obvious. But what happens when we try to pass a struct to a method that accepts interface instance? Let’s take a look.

[MemoryDiagnoser]
[RyuJitX64Job, LegacyJitX86Job]
public class ValueTypeInvokingInterfaceMethod
{
    interface IInterface
    {
        void DoNothing();
    }

    class ReferenceTypeImplementingInterface : IInterface
    {
        public void DoNothing() { }
    }

    struct ValueTypeImplementingInterface : IInterface
    {
        public void DoNothing() { }
    }

    private ReferenceTypeImplementingInterface reference = new ReferenceTypeImplementingInterface();
    private ValueTypeImplementingInterface value = new ValueTypeImplementingInterface();

    [Benchmark(Baseline = true)]
    public void ValueType() => AcceptingInterface(value);

    [Benchmark]
    public void ReferenceType() => AcceptingInterface(reference);

    void AcceptingInterface(IInterface instance) => instance.DoNothing();
}

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  

Method	Jit	Platform	Mean	Scaled	Gen 0	Allocated
ValueType	LegacyJit	X86	5.738 ns	1.00	0.0038	12 B
ReferenceType	LegacyJit	X86	1.910 ns	0.33	-	0 B
ValueType	RyuJit	X64	5.754 ns	1.00	0.0076	24 B
ReferenceType	RyuJit	X64	1.845 ns	0.32	-	0 B

Once again we got into boxing. Did you expect it?!

How to avoid boxing with value types that implement interfaces?

We need to use generic constraints. The method should not accept IInterface but T which implements IInterface.

void Trick<T>(T instance)
    where T : IInterface
{
    instance.Method();
}

Benchmarks

[MemoryDiagnoser]
[RyuJitX64Job]
public class ValueTypeInvokingInterfaceMethodSmart
{
    // IInterface, ReferenceTypeImplementingInterface, ValueTypeImplementingInterface and fields are declared in previous benchmark

    [Benchmark(Baseline = true, OperationsPerInvoke = 16)]
    public void ValueType()
    {
        AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value);
        AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value);
        AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value);
        AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value); AcceptingInterface(value);
    }

    [Benchmark(OperationsPerInvoke = 16)]
    public void ValueTypeSmart()
    {
        AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value);
        AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value);
        AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value);
        AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value);
        AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value); AcceptingSomethingThatImplementsInterface(value);
    }

    [Benchmark(OperationsPerInvoke = 16)]
    public void ReferenceType()
    {
        AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference);
        AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference);
        AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference);
        AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference); AcceptingInterface(reference);
    } 

    void AcceptingInterface(IInterface instance) => instance.DoNothing();

    void AcceptingSomethingThatImplementsInterface<T>(T instance)
        where T : IInterface
    {
        instance.DoNothing();
    }
}

Note: I have used OperationsPerInvoke feature of BenchmarkDotNet which is very usefull for nano-benchmarks.

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]    : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Job=RyuJitX64  Jit=RyuJit  Platform=X64  

Method	Mean	Error	StdDev	Scaled	Gen 0	Allocated
ValueType	5.572 ns	0.0322 ns	0.0252 ns	1.00	0.0076	24 B
ValueTypeSmart	1.145 ns	0.0101 ns	0.0094 ns	0.21	-	0 B
ReferenceType	2.212 ns	0.0096 ns	0.0081 ns	0.40	-	0 B

By applying this simple trick we were able to not only avoid boxing but also outperform reference type interface method invocation! It was possible due to the optimization performed by JIT. I am going to call it method de-virtualization because I don’t have a better name for it. How does it work? Let’s consider following example:

Note: Previous version of this blog post had a bug, which was spotted by Fons Sonnemans. There is no need for extra struct constraint to avoid boxing. Thank you Fons!

public void Method<T>(T instance)
        where T : IDisposable
{
        instance.Dispose();
}

When the T is constrained with where T : INameOfTheInterface, the C# compiler emits additional IL instruction called constrained (Docs).

 .method public hidebysig 
    instance void Method<([mscorlib]System.IDisposable) T> (
        !!T 'instance'
    ) cil managed 
{
    .maxstack 8

    IL_0000: ldarga.s 'instance'
    IL_0002: constrained. !!T
    IL_0008: callvirt instance void [mscorlib]System.IDisposable::Dispose()
    IL_000d: ret
} // end of method C::Method

If the method is not generic, there is no constraint and the instance can be anything: value or reference type. In case it’s value type, the JIT performs boxing. When the method is generic, JIT compiles a separate version of it per every value type. Which prevents boxing! How does it work?

JIT handles value types in a different way than reference types. Operations, like passing to a method or returning from it are the same for all reference types. We always deal with pointers, which have single, same size for all reference types. So JIT is reusing the compiled generic code for reference types because it can treat them in the same way. Imagine an array of objects or strings. From JITs perspective, it is just an array of pointers. So the array’s indexer implementation will be the same for all reference types.

Value Types are different. Each of them can have different size. For example passing integer and custom struct with two integer fields to a method has a different native implementation. In one case we push single int to the stack, in the other, we might need to move two fields to the registers, and then push them to the stack. So it’s different per every value type.

This is why JIT compiles ever generic method/type separately for generic value types arguments.

Method<object>(); // JIT compiled code is common for all reference types
Method<string>(); // JIT compiled code is common for all reference types
Method<int>(); // dedicated version for int
Method<long>(); // dedicated version for long
Method<DateTime>(); // dedicated version for DateTime

It might lead to generic Code Bloat. But the great thing is that at this point in time, JIT can compile tailored code per type. And since the type is know, it can replace virtual call with direct call. As Victor Baybekov mentioned in the comments, it can even remove the unnecessary null check for the call. It’s value type, so it can not be null. Inlining is also possible. For small methods, which are executed very often, like .Equals() in custom Dictionary implementation it can be very big performance gain.

We can see the effect of inlining if we run the same benchmarks for .NET 4.7, where RyuJit got improved and inlines all calls to AcceptingSomethingThatImplementsInterface.

BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338348 Hz, Resolution=427.6523 ns, Timer=TSC
  [Host]       : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2106.0
  LegacyJitX64 : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.2106.0;compatjit-v4.7.2106.0
  LegacyJitX86 : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2106.0
  RyuJitX64    : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2106.0

Method	Job	Jit	Platform	Mean	Error	StdDev
ValueTypeSmart	LegacyJitX64	LegacyJit	X64	1.2906 ns	0.0217 ns	0.0182 ns
ValueTypeSmart	LegacyJitX86	LegacyJit	X86	0.3367 ns	0.0064 ns	0.0060 ns
ValueTypeSmart	RyuJitX64	RyuJit	X64	0.0004 ns	0.0006 ns	0.0005 ns

Note: If you would like to play with generated IL code you can use the awesome SharpLab.

Copying

In C# by default Value Types are passed to methods by value. It means that the Value Type instance is copied every time we pass it to a method. Or when we return it from a method. The bigger the Value Type is, the more expensive it is to copy it. For small value types, the JIT compiler might optimize the copying (inline the method, use registers for copying & more).

[RyuJitX64Job, LegacyJitX86Job]
public class CopyingValueTypes
{
    class ReferenceType1Field { int X; }
    class ReferenceType2Fields { int X, Y; }
    class ReferenceType3Fields { int X, Y, Z; }

    struct ValueType1Field { int X; }
    struct ValueType2Fields { int X, Y; }
    struct ValueType3Fields { int X, Y, Z; }

    ReferenceType1Field fieldReferenceType1Field = new ReferenceType1Field();
    ReferenceType2Fields fieldReferenceType2Fields = new ReferenceType2Fields();
    ReferenceType3Fields fieldReferenceType3Fields = new ReferenceType3Fields();

    ValueType1Field fieldValueType1Field = new ValueType1Field();
    ValueType2Fields fieldValueType2Fields = new ValueType2Fields();
    ValueType3Fields fieldValueType3Fields = new ValueType3Fields();

    [MethodImpl(MethodImplOptions.NoInlining)] ReferenceType1Field Return(ReferenceType1Field instance) => instance;
    [MethodImpl(MethodImplOptions.NoInlining)] ReferenceType2Fields Return(ReferenceType2Fields instance) => instance;
    [MethodImpl(MethodImplOptions.NoInlining)] ReferenceType3Fields Return(ReferenceType3Fields instance) => instance;

    [MethodImpl(MethodImplOptions.NoInlining)] ValueType1Field Return(ValueType1Field instance) => instance;
    [MethodImpl(MethodImplOptions.NoInlining)] ValueType2Fields Return(ValueType2Fields instance) => instance;
    [MethodImpl(MethodImplOptions.NoInlining)] ValueType3Fields Return(ValueType3Fields instance) => instance;

    [Benchmark(OperationsPerInvoke = 16)]
    public void TestReferenceType1Field()
    {
        var instance = fieldReferenceType1Field;
        instance = Return(instance); instance = Return(instance); instance = Return(instance); instance = Return(instance);
        instance = Return(instance); instance = Return(instance); instance = Return(instance); instance = Return(instance);
        instance = Return(instance); instance = Return(instance); instance = Return(instance); instance = Return(instance);
        instance = Return(instance); instance = Return(instance); instance = Return(instance); instance = Return(instance);
    }

    // removed
}

The rest of the code was removed for brevity. You can find full code here.

BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  

Method	Jit	Platform	Mean
TestReferenceType1Field	LegacyJit	X86	1.399 ns
TestReferenceType2Fields	LegacyJit	X86	1.392 ns
TestReferenceType3Fields	LegacyJit	X86	1.388 ns
TestReferenceType1Field	RyuJit	X64	1.737 ns
TestReferenceType2Fields	RyuJit	X64	1.770 ns
TestReferenceType3Fields	RyuJit	X64	1.711 ns

Passing and returning Reference Types is size-independent. Only a copy of the pointer is passed. And pointer can always fit into CPU register. ```

Method	Jit	Platform	Mean
TestValueType1Field	LegacyJit	X86	1.410 ns
TestValueType2Fields	LegacyJit	X86	6.859 ns
TestValueType3Fields	LegacyJit	X86	6.837 ns
TestValueType1Field	RyuJit	X64	1.465 ns
TestValueType2Fields	RyuJit	X64	8.403 ns
TestValueType3Fields	RyuJit	X64	2.627 ns

The bigger the Value Type is, the more expensive copying is. Have you noticed that TestValueType3Fields was faster than TestValueType2Fields for RyuJit? To answer the question why we would need to analyse the generated native assembly code.

How can we avoid copying big Value Types? We should pass and return them by Reference! I am going to leave it here, and continue with my ref returns and locals blog post next week.

Summary

Every instance of a reference type has two extra fields used internally by CLR.
Value Types have no hidden overhead, so they have better data locality.
Reference Types are managed by GC. It tracks the references, offers fast allocation and expensive, non-deterministic deallocation.
Value Types are not managed by the GC. Value Types = No GC. And No GC is better than any GC!
Whenever a reference is required value types are being boxed. Boxing is expensive, adds an extra pressure for the GC. You should avoid boxing if you can.
By using generic constraints we can avoid boxing and even de-virtualize interface method calls for Value Types!
Value Types are passed to and returned from methods by Value. So by default, they are copied all the time.

VERY Important!! Pro .NET Performance book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow has a whole chapter dedicated to Type Internals. If you want to learn more about it, you should definitely read it. It’s the best source available, my blog post is just an overview!

Sources

Pro .NET Performance book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow
How does Object.GetType() really work? blog post by Konrad Kokosa
Safe Systems Programming in C# and .NET video by Joe Duffy
Memory Systems article by University Of Mary Washington
Latency Numbers Every Programmer Should Know article by Berkeley University
Types of locality definition by Wikipedia
Understanding How General Exploration Works in Intel® VTune™ Amplifier XE by Jackson Marusarz (Intel)
A new stackalloc operator for reference types with CoreCLR and Roslyn blog post by Alexandre Mutel
Boxing and Unboxing article by MSDN
Heap Allocations Viewer plugin blog post by Matt Ellis (JetBrains)
SharpLab.io
OpCodes.Constrained Field article by MSDN
.NET Generics and Code Bloat article by MSDN
What happens with a generic constraint that removes this requirement? Stack Overflow answer by Eric Lippert