<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://adamsitnik.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://adamsitnik.com/" rel="alternate" type="text/html" /><updated>2025-01-13T15:54:36+00:00</updated><id>https://adamsitnik.com/feed.xml</id><title type="html">Adam Sitnik</title><subtitle>.NET Performance and Reliability</subtitle><entry><title type="html">Profiling .NET on Linux with BenchmarkDotNet</title><link href="https://adamsitnik.com/PerfCollectProfiler/" rel="alternate" type="text/html" title="Profiling .NET on Linux with BenchmarkDotNet" /><published>2023-01-13T00:00:00+00:00</published><updated>2023-01-13T00:00:00+00:00</updated><id>https://adamsitnik.com/PerfCollectProfiler</id><content type="html" xml:base="https://adamsitnik.com/PerfCollectProfiler/"><![CDATA[<h1 id="perfcollectprofiler">PerfCollectProfiler</h1>

<p><code class="language-plaintext highlighter-rouge">PerfCollectProfiler</code> is a new BenchmarkDotNet diagnoser (plugin) that was released as part of <a href="https://benchmarkdotnet.org/changelog/v0.13.3.html">0.13.3</a>. It can profile the benchmarked .NET code on Linux and export the data to a trace file which can be opened with <a href="https://github.com/Microsoft/perfview">PerfView</a>, <a href="https://www.speedscope.app/">speedscope</a> or any other tool that supports <a href="https://en.wikipedia.org/wiki/Perf_%28Linux%29">perf</a> file format.</p>

<!--more-->

<h2 id="demo">Demo</h2>

<p>Following code is one of the official BenchmarkDotNet <a href="https://github.com/dotnet/BenchmarkDotNet/blob/12bf220e11fddc8e65b066eb1f300b63bfde7e9b/samples/BenchmarkDotNet.Samples/IntroPerfCollectProfiler.cs">samples</a></p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="nn">System.IO</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">BenchmarkDotNet.Attributes</span><span class="p">;</span>

<span class="k">namespace</span> <span class="nn">BenchmarkDotNet.Samples</span>
<span class="p">{</span>
    <span class="p">[</span><span class="nf">PerfCollectProfiler</span><span class="p">(</span><span class="n">performExtraBenchmarksRun</span><span class="p">:</span> <span class="k">false</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">class</span> <span class="nc">IntroPerfCollectProfiler</span>
    <span class="p">{</span>
        <span class="k">private</span> <span class="k">readonly</span> <span class="kt">string</span> <span class="n">path</span> <span class="p">=</span> <span class="n">Path</span><span class="p">.</span><span class="nf">Combine</span><span class="p">(</span><span class="n">Path</span><span class="p">.</span><span class="nf">GetTempPath</span><span class="p">(),</span> <span class="n">Path</span><span class="p">.</span><span class="nf">GetRandomFileName</span><span class="p">());</span>
        <span class="k">private</span> <span class="k">readonly</span> <span class="kt">string</span> <span class="n">content</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">string</span><span class="p">(</span><span class="sc">'a'</span><span class="p">,</span> <span class="m">100</span><span class="n">_000</span><span class="p">);</span>

        <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
        <span class="k">public</span> <span class="k">void</span> <span class="nf">WriteAllText</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">File</span><span class="p">.</span><span class="nf">WriteAllText</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">content</span><span class="p">);</span>

        <span class="p">[</span><span class="n">GlobalCleanup</span><span class="p">]</span>
        <span class="k">public</span> <span class="k">void</span> <span class="nf">Delete</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">File</span><span class="p">.</span><span class="nf">Delete</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The command:</p>

<pre><code class="language-cmd">sudo dotnet run -c Release -f net7.0 --filter '*PerfCollectProfiler*' --profiler perf --job short
</code></pre>

<p>The regular output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.13.3.20230113-develop, OS=ubuntu 18.04
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.101
  [Host] : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2
  Dry    : .NET 7.0.1 (7.0.122.56804), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  


// * Diagnostic Output - PerfCollectProfiler *
Exported 1 trace file(s). Example:
/home/adam/projects/BenchmarkDotNet/samples/BenchmarkDotNet.Samples/BenchmarkDotNet.Artifacts/BenchmarkDotNet.Samples.IntroPerfCollectProfiler.WriteAllText-20230113-180354.trace.zip
</code></pre></div></div>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>WriteAllText</td>
        <td style="text-align: right">96.83 us</td>
        <td style="text-align: right">51.98 us</td>
        <td style="text-align: right">2.849 us</td>
      </tr>
    </tbody>
  </table>

</div>

<p>And the new trace file opened with speedscope:</p>

<p class="center"><img src="/images/perfcollectprofiler/arm64.png" alt="speedscope" /></p>

<h2 id="the-story">The Story</h2>

<p>In the middle of 2019 I was working on improving the performance of <code class="language-plaintext highlighter-rouge">string</code> methods on Linux (<a href="https://github.com/dotnet/coreclr/pull/24889">#24889</a>, <a href="https://github.com/dotnet/coreclr/pull/24973">#24973</a>). When I was done with the issues reported by the customers, it became clear to me that I need to get a better understanding of all Windows vs Unix .NET performance gaps. My goal was to fix the most important issues before the customers hit them. Thanks to previous investments in the performance culture of the .NET Team it was an easy job, as all I had to do was running all <a href="https://github.com/dotnet/performance">dotnet/performance</a> micro-benchmarks on the same hardware for Windows and Unix and compare the results using <a href="https://github.com/dotnet/performance/tree/main/src/tools/ResultsComparer">ResultsComparer</a> tool. To make it an apples-to-apples comparison I configured my work PC to <a href="https://en.wikipedia.org/wiki/Multi-booting">dual-boot</a> Windows 10 and Ubuntu 18.04. I wanted to include macOS too, as many .NET users develop software on macOS. So I’ve used <a href="https://en.wikipedia.org/wiki/Boot_Camp_(software)">Boot Camp</a> and installed Windows on my MacBook Pro. The comparison has identified multiple gaps: <a href="https://github.com/dotnet/runtime/issues/13628">#13628</a>, <a href="https://github.com/dotnet/runtime/issues/31268">#31268</a>, <a href="https://github.com/dotnet/runtime/issues/31269">#31269</a>, <a href="https://github.com/dotnet/runtime/issues/31270">#31270</a>, <a href="https://github.com/dotnet/runtime/issues/31271">#31271</a>, <a href="https://github.com/dotnet/runtime/issues/31273">#31273</a>, <a href="https://github.com/dotnet/runtime/issues/31275">#31275</a>, <a href="https://github.com/dotnet/runtime/issues/13669">#13669</a>, <a href="https://github.com/dotnet/runtime/issues/13675">#13675</a>, <a href="https://github.com/dotnet/runtime/issues/13676">#13676</a>, <a href="https://github.com/dotnet/runtime/issues/13684">#13684</a> and <a href="https://github.com/dotnet/runtime/issues/31396">#31396</a>.</p>

<p>Since there was plenty of them, I decided to automate the profiling. I’ve created a new BenchmarkDotNet branch and started working on a wrapper for <a href="https://github.com/dotnet/runtime/blob/main/docs/project/linux-performance-tracing.md">perfcollect</a>. perfcollect is a very powerful <a href="https://github.com/microsoft/perfview/blob/main/src/perfcollect/perfcollect">bash script</a> that automates data collection. It’s internally using perf and LTTng. All the credit for perfcollect goes to <a href="https://github.com/brianrob">Brian Robbins</a>, who authored this tool. perfcollect does all the heavy lifting, my <strong>BenchmarkDotNet plugin is built upon on the work of Brian</strong>.</p>

<p>I was not able to get it working quickly and I got stuck, so I pushed my changes and just switched to <a href="https://github.com/dotnet/performance/blob/main/docs/profiling-workflow-dotnet-runtime.md#vtune">VTune</a>. For this particular investigation, I’ve stopped using perfcollect as I did not like the fact that I had to copy the produced trace file to Windows to open it with PerfView. VTune provided me the answers I needed and I’ve moved one to some other work.</p>

<p class="center"><img src="/images/perfcollectprofiler/vtune.png" alt="VTune" /></p>

<p>I got back to working on it in 2020, but again with no success. In September 2022 together with <a href="https://github.com/janvorli">Jan Vorlicek</a> we have started working on adding arm64 support to the <a href="https://adamsitnik.com/Disassembly-Diagnoser/">Disassembly Diagnoser</a> (another topic for a blog post). We did that as a part of a one week long internal Microsoft Open Source hackathon. We made some great progress quicker than expected and still had some time left, so I’ve asked Jan for help with the perfcollect plugin (BTW Jan is one of the smartest people I’ve ever got to work with, at the same time being very humble and always eager to help). Jan has identified the source of my problems, I’ve changed the way of stopping the perfcollect process and got it working. The rest is history.</p>

<h2 id="how-it-works">How it works</h2>

<p><code class="language-plaintext highlighter-rouge">PerfCollectProfiler</code> uses <a href="https://github.com/microsoft/perfview/blob/main/src/perfcollect/perfcollect">perfcollect</a> bash script for profiling and <a href="https://learn.microsoft.com/dotnet/core/diagnostics/dotnet-symbol">dotnet symbol</a> for downloading symbols for the native libraries.</p>

<p>Before the process with benchmarked code is started, the plugin searches for perfcollect file stored in artifacts folder. If it’s not present it means that the tool has not been installed yet. In such case, it loads the script file from library resources (the script is embeded in the <code class="language-plaintext highlighter-rouge">.dll</code> to ensure we are using a version that we tested), stores it one the disk, makes it an executable and invokes the install command (with <code class="language-plaintext highlighter-rouge">-force</code> option to avoid the need of user input for confirmation).</p>

<p>The next thing it does is identifying .NET SDK folder path and searching for missing native symbol files (<code class="language-plaintext highlighter-rouge">.so.dbg</code>). When some symbols are missing, it installs <code class="language-plaintext highlighter-rouge">dotnet symbol</code> tool in a dedicated folder (to always use latest version and avoid issues with broken existing configs) and commands it to recursively download symbols for all native libraries present in the SDK folder.</p>

<p>Sample log output:</p>

<pre><code class="language-log">// start dotnet tool install dotnet-symbol --tool-path "/tmp/BenchmarkDotNet/symbols" in 
You can invoke the tool using the following command: dotnet-symbol
Tool 'dotnet-symbol' (version '1.0.406001') was successfully installed.
// command took 2.36s and exited with 0
// start /tmp/BenchmarkDotNet/symbols/dotnet-symbol --recurse-subdirectories --symbols "/usr/share/dotnet/dotnet" "/usr/share/dotnet/lib*.so" in 
Downloading from https://msdl.microsoft.com/download/symbols/
/usr/share/dotnet/dotnet.dbg already exists, file not written
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Globalization.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libcoreclrtraceptprovider.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Security.Cryptography.Native.OpenSsl.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libmscordaccore.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.IO.Compression.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Net.Security.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libcoreclr.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libSystem.Native.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libclrjit.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libmscordbi.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libhostpolicy.so.dbg
Writing: /usr/share/dotnet/shared/Microsoft.NETCore.App/7.0.1/libclrgc.so.dbg
</code></pre>

<p>Once everything is in place, the diagnoser starts perfcollect process. perfcollect does all the heavy lifting (the creation of lttng sessions, perf tool usage etc). When BenchmarkDotNet starts the benchmarking process, it <a href="https://github.com/dotnet/BenchmarkDotNet/blob/12bf220e11fddc8e65b066eb1f300b63bfde7e9b/src/BenchmarkDotNet/Extensions/ProcessExtensions.cs#L133-L142">sets all the necessary environment variables</a>. By doing that, it ensures that <strong>all</strong> symbols will get solved and the trace file will be complete.</p>

<p>When benchmarking process quits, the plugin stops the perfcollect process by killing it with <code class="language-plaintext highlighter-rouge">SIGINT</code> signal. perfcollect works for a moment, exports the trace file and quits.</p>

<h2 id="limitations">Limitations</h2>

<p><code class="language-plaintext highlighter-rouge">PerfCollectProfiler</code> has following limitations:</p>

<ul>
  <li>It supports only Linux. For Windows you can use <a href="https://adamsitnik.com/ETW-Profiler/">EtwProfiler</a>, for other Unixes like macOS the <a href="https://wojciechnagorski.com/2020/04/cross-platform-profiling-.net-code-with-benchmarkdotnet/">EventPipeProfiler</a>. EventPipeProfiler supports every OS, but it has no information about native methods.</li>
  <li>Requires to run as root. It’s a PITA as all the files BDN creates directly and indirectly will be created by the root.</li>
  <li>No <code class="language-plaintext highlighter-rouge">InProcessToolchain</code> support (and no plans to add it).</li>
  <li>Currently the trace file contains no events, but <a href="https://github.com/microsoft/perfview/issues/1718">we are working on it</a>.</li>
</ul>

<h2 id="how-to-use-it">How to use it?</h2>

<p>You need to install <code class="language-plaintext highlighter-rouge">BenchmarkDotNet</code> 0.13.3 or newer.</p>

<p>It can be enabled via command line arguments (as long as you pass <code class="language-plaintext highlighter-rouge">args</code> to <code class="language-plaintext highlighter-rouge">BenchmarkSwitcher</code> or <code class="language-plaintext highlighter-rouge">BenchmarkRunner</code>). You won’t need to recompile your code to use it:</p>

<pre><code class="language-cmd">--profiler perf
</code></pre>

<p>Or you can extend the <code class="language-plaintext highlighter-rouge">DefaultConfig.Instance</code> with new instance of <code class="language-plaintext highlighter-rouge">PerfCollectProfiler</code> and the profiler will work for all benchmarks:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> 
        <span class="p">=&gt;</span> <span class="n">BenchmarkSwitcher</span>
            <span class="p">.</span><span class="nf">FromAssembly</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">Program</span><span class="p">).</span><span class="n">Assembly</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">Run</span><span class="p">(</span><span class="n">args</span><span class="p">,</span>
                <span class="n">DefaultConfig</span><span class="p">.</span><span class="n">Instance</span>
                    <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">PerfCollectProfiler</span><span class="p">.</span><span class="n">Default</span><span class="p">));</span> <span class="c1">// HERE</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Or you can apply the attribute, but it will work only for benchmarks from given <code class="language-plaintext highlighter-rouge">class</code>:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">PerfCollectProfiler</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">TheClassThatContainsBenchmarks</span> <span class="p">{</span> <span class="cm">/* benchmarks go here */</span> <span class="p">}</span>
</code></pre></div></div>

<h2 id="configuration">Configuration</h2>

<p>To configure the new diagnoser you need to create an instance of <code class="language-plaintext highlighter-rouge">PerfCollectProfilerConfig</code> class and pass it to the <code class="language-plaintext highlighter-rouge">PerfCollectProfiler</code> constructor. The parameters that config ctor accepts are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">performExtraBenchmarksRun</code>: when set to true, benchmarks will be executed one more time with the profiler attached. If set to false, there will be no extra run but the results will contain overhead. False by default, as I expect the overhead to be less than 3%.</li>
  <li><code class="language-plaintext highlighter-rouge">timeoutInSeconds</code>: how long BenchmarkDotNet should wait for the perfcollect script to finish processing the trace. 300s by default.</li>
</ul>

<p>The default config should be fine for 99% of users ;)</p>

<h2 id="analyzing-the-trace-files">Analyzing the trace files</h2>

<p>There are multiple ways to work with the trace files produced by perfcollect:</p>
<ul>
  <li>You can copy them to Windows and open with <a href="https://learn.microsoft.com/en-us/shows/perfview-tutorial/">PerfView</a>.</li>
  <li>You can unzip the trace file, take the file produced by perf utility and open it with any tool that supports perf file format.</li>
</ul>

<p class="center"><img src="/images/perfcollectprofiler/tracefilezip.png" alt="tracefile" /></p>

<p>If you are not familiar with speedscope you can read my <a href="https://adamsitnik.com/speedscope/#demo">old blog post</a> about it. The tool is so intuitive that you don’t really need to prepare yourself for using it.</p>

<ol>
  <li>Unzip the trace file, go to <a href="https://www.speedscope.app/">https://www.speedscope.app/</a>, select Browse and choose the <code class="language-plaintext highlighter-rouge">perf.data.txt</code> file.</li>
  <li>perfcollect by default performs machine-wide profiling and speedscope shows only data from one thread at a time. So you need to select the thread from the thread list:</li>
</ol>

<p class="center"><img src="/images/perfcollectprofiler/choosethread.png" alt="tracefile" /></p>

<ol>
  <li>Now you can just choose on of the tabs, depending on what kind of visualization you prefer. In case you like flamegraphs you can go to “Left Heavy”:</li>
</ol>

<p class="center"><img src="/images/perfcollectprofiler/leftheavy.png" alt="LeftHeavy" /></p>

<p>By default, BenchmarkDotNet performs Warmup, Pilot and Overhead phases before starting the actual benchmark workload. Just filter the trace file to actual workload, unless you are interested is cold startup time.</p>

<h3 id="special-thanks">Special Thanks</h3>

<p>I wanted to thank:</p>

<ul>
  <li>Brian Robbins for authoring perfcollect and providing ongoing help.</li>
  <li>Jan Vorlicek for helping me with the investigation and unblocking me.</li>
</ul>

<h2 id="no-blog-posts-for-the-last-four-years">No blog posts for the last four years</h2>

<p>I have not posted anything on this blog for almost four years. I simply lost the motivation, and I am not comfortable speaking in public about the reasons behind it.</p>

<p>But one of the things that makes me very happy is helping animals. Last year I’ve officially become a volunteer in a local animal shelter. My duties are mainly cleaning the cages, feeding the bunnies, and driving them to/from the vet. But I am also helping the shelter financially.</p>

<p>To optimize my impact, I wanted to kindly ask you for a donation for the bunnies. You can do it online via <a href="https://pomagam.pl/en/nowyrok-staredlugi">https://pomagam.pl/en/nowyrok-staredlugi</a> website. For translation from polish you can use <a href="https://pomagam-pl.translate.goog/en/nowyrok-staredlugi?_x_tr_sl=pl&amp;_x_tr_tl=en&amp;_x_tr_hl=en-US&amp;_x_tr_pto=wapp">this</a> google translate link.</p>

<p>If you donate please leave a comment on the donation website, here on my blog, send me an email or tag me on Twitter. I am going to donate the same amount and respond back. To optimize even further I am going to fill in the paperwork and ask my current employer (Microsoft) to donate the same amount I’ve donated (yes, MS offers such a perk!). So, for every dollar you donate, the shelter gets three dollars.</p>

<p>I want to verify what is the best way I can help the shelter: cleaning bunnies cages or sharing my knowledge online and asking for donations.</p>

<p>If you restore my faith in humanity I am going to blog again. Possible topics:</p>

<ul>
  <li>Cross platform and cross architecture disassembler.</li>
  <li>Startup time performance investigation based on System.CommandLine example.</li>
  <li>The story and reasoning behind improving Sockets performance on Linux in .NET 5.</li>
  <li>Best practices for Fast File IO with .NET.</li>
</ul>

<p>Thank you,
Adam</p>]]></content><author><name></name></author><summary type="html"><![CDATA[PerfCollectProfiler PerfCollectProfiler is a new BenchmarkDotNet diagnoser (plugin) that was released as part of 0.13.3. It can profile the benchmarked .NET code on Linux and export the data to a trace file which can be opened with PerfView, speedscope or any other tool that supports perf file format.]]></summary></entry><entry><title type="html">Profiling .NET Code with PerfView and visualizing it with speedscope.app</title><link href="https://adamsitnik.com/speedscope/" rel="alternate" type="text/html" title="Profiling .NET Code with PerfView and visualizing it with speedscope.app" /><published>2019-03-22T00:00:00+00:00</published><updated>2019-03-22T00:00:00+00:00</updated><id>https://adamsitnik.com/speedscope</id><content type="html" xml:base="https://adamsitnik.com/speedscope/"><![CDATA[<h1 id="speedscopeapp">speedscope.app</h1>

<p>According to the <a href="https://github.com/jlfwong/speedscope">official web page</a>, <a href="https://www.speedscope.app/">speedscope.app</a> is “a fast, interactive web-based viewer for performance profiles”. But I believe it’s more than that! In my opinion, it’s one of the best visualization tools for performance profiles ever!</p>

<p>Some time ago I have implemented <a href="https://github.com/Microsoft/perfview/pull/842">SpeedScopeExporter</a> which allows exporting any .NET Trace file to a speedscope json file format. It was released as part of <code class="language-plaintext highlighter-rouge">2.0.34</code> <a href="https://www.nuget.org/packages/Microsoft.Diagnostics.Tracing.TraceEvent/">TraceEvent</a> library a few months ago, but so far it was not available for the end users from PerfView GUI/command line level.</p>

<p>Yesterday, a new version of PerfView got <a href="https://github.com/Microsoft/perfview/releases/tag/P2.0.39">released</a> with the new possibility to export to speed scope file format. So now the PerfView users can use <a href="https://www.speedscope.app/">speedscope.app</a> to view their performance profiles and take advantage of all the goodness it offers!</p>

<p class="center"><img src="https://user-images.githubusercontent.com/150329/40900669-86eced80-6781-11e8-92c1-dc667b651e72.gif" alt="Demo" /></p>

<!--more-->

<h2 id="the-story">The Story</h2>

<p>Vance Morrison, the .NET Performance Architect has emailed me and asked if I would like to implement web-based flame graphs for our non-Windows user story as I was the guy who <a href="https://github.com/Microsoft/perfview/pull/502">implemented</a> it for PerfView a while back.</p>

<p>I knew that the app which does exactly what we needed have already existed. I did not remember the name but I remembered that I retweeted something about it, so I have opened the list of my retweeted posts and found it quickly.</p>

<p>I made sure that the app does not upload the data anywhere, has an MIT license, is actively maintained, has a self-contained version and can handle non-trivial files.</p>

<p>To my surprise, convincing Vance to use it was easy ;)</p>

<p>I did not want to introduce another file format so I decided to export the data to a very simple speedscope JSON-based file format (<a href="https://github.com/jlfwong/speedscope/blob/master/src/lib/file-format-spec.ts">spec</a>).</p>

<p>I read the file format specification, wrote the tests first and made it work. Handling all edge cases was a lot of fun!</p>

<h2 id="how-it-works">How it works</h2>

<p><a href="https://www.speedscope.app/">speedscope.app</a> is a single page application that works with any modern web browser. It supports plenty of file formats, including <a href="https://github.com/jlfwong/speedscope/wiki/Importing-from-perf-(linux)">perf</a>, <a href="https://github.com/jlfwong/speedscope/wiki/Importing-from-pprof-(go)">pprof</a>, <a href="https://github.com/jlfwong/speedscope/wiki/Importing-from-Chrome">chrome</a> and <a href="https://github.com/jlfwong/speedscope/wiki/Importing-from-Firefox">firefox</a>. The profiles are not uploaded anywhere!</p>

<p>Every Trace File contains a huge amount of samples. A sample is more or less a call stack captured by the profiler. The job of <a href="https://github.com/Microsoft/perfview/blob/master/src/TraceEvent/Stacks/SpeedScopeStackSourceWriter.cs">SpeedScopeExporter</a> is to group the samples by Threads, make sure they don’t overlap in time, translate the frame ids to method names and save it in a format and order expected by the speedscope.</p>

<h2 id="how-to-use-it">How to use it?</h2>

<p>If you want to export a trace file from PerfView, you need to open it, load the symbols, filter and from the “File” menu choose “Save View As” and then select “Speed Scope Format” from the combo box.</p>

<p class="center"><img src="./../images/speedscope/saveas.png" alt="SaveAs" /></p>

<p>If you want to export a .NET Trace File to speedscope file format without using PerfView you either have to use TraceEven library yourself or wait until the dotnet collect diagnostic tool <a href="https://github.com/dotnet/diagnostics/pull/114">merges</a> the support for it.</p>

<p>Once you have the <code class="language-plaintext highlighter-rouge">.speedscope.json</code> file you just need to open it with <a href="https://www.speedscope.app/">speedscope.app</a>. You can also download a self-contained version from <a href="https://github.com/jlfwong/speedscope/releases">https://github.com/jlfwong/speedscope/releases</a>.</p>

<p class="center"><img src="./../images/speedscope/open.gif" alt="Open" /></p>

<h2 id="demo">Demo</h2>

<p>I have profiled the following C# app with PerfView and exported the trace file to <code class="language-plaintext highlighter-rouge">.speedscope.json</code> file. For brevity, I am going to skip the PerfView introduction here. You can find the JSON file <a href="https://gist.github.com/adamsitnik/1b34626c20b86b48e0aca593567023f5">here</a>.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="nf">A</span><span class="p">();</span> <span class="nf">B</span><span class="p">();</span> <span class="nf">A</span><span class="p">();</span> <span class="nf">B</span><span class="p">();</span> <span class="nf">A</span><span class="p">();</span> <span class="nf">B</span><span class="p">();</span> <span class="nf">A</span><span class="p">();</span> <span class="nf">B</span><span class="p">();</span> <span class="nf">A</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">A</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="m">500</span><span class="n">_000_000</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="p">}</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">B</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="m">500</span><span class="n">_000_000</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span> <span class="p">{</span> <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="time-order-view">Time Order View</h3>

<p class="center"><img src="./../images/speedscope/TimeOrder.png" alt="TimeOrder" /></p>

<p>In the “Time Order” view, call stacks are ordered in chronological order. This is very unique compared to Flame Graphs because it allows us to understand the behavior of an application over time.</p>

<h3 id="left-heavy">Left Heavy</h3>

<p class="center"><img src="./../images/speedscope/LeftHeavy.png" alt="LeftHeavy" /></p>

<p>The “Left Heavy” is a reverse Flame Graph. The data is aggregated, not over time.</p>

<h3 id="sandwitch">Sandwitch</h3>

<p class="center"><img src="./../images/speedscope/Sandwitch.png" alt="Sandwitch" /></p>

<p>The Sandwich view is a table view with all methods from the profile and their associated times. You can sort by self time (exclusive time) or total time (inclusive time).</p>

<p>When you click on one of the methods, you can see all the callers and callees of it. The app is so intuitive that it almost does not need any docs!</p>

<h2 id="limitations">Limitations</h2>

<p>It’s not possible to show data from multiple threads running at the same time on a single view, so every Thread has it’s own “tab” and you can switch between the threads using the arrows:</p>

<p class="center"><img src="./../images/speedscope/Threads.png" alt="Threads" /></p>

<p>Moreover, the app normalizes the relative time for every Thread. If we export a profile for Thread A that did some work between 0-200 ms and Thread B that did some work between 100-110ms the app will show the start time as 0ms for both of them and Thread B activity will be represented as between 0ms and 10ms (not 100-110ms). I was thinking about generating a 1e-10 ms long event at time 0ms for every Thread, but then the app would not scale the UI so nice.</p>

<p class="center"><img src="./../images/speedscope/Finalizer.png" alt="Finalizer" /></p>

<p>If we don’t have any profile information for a given period of time, we have nothing to show in the “Time Order View”. It’s important to remember that tracing in .NET captures only the call stacks of active Threads, so any blocking IO will be represented as a blank space. In the future, I might use the data from OS/.NET Runtime events to fill this space.</p>

<p class="center"><img src="./../images/speedscope/Empty.png" alt="Empty" /></p>

<h2 id="sample-usage---net-core-process-startup-time">Sample usage - .NET Core Process Startup Time</h2>

<p>Using the new tool we can find out how long does it take to start a .NET Core process and see what exactly happens in what order during the startup.</p>

<p>To do that, we need to create a “Hello World” .NET Core app first.</p>

<pre><code class="language-cmd">dotnet new console
</code></pre>

<p>Then tell the PerfView to Run following command:</p>

<pre><code class="language-cmd">dotnet run -c Release
</code></pre>

<p class="center"><img src="./../images/speedscope/Run.png" alt="Empty" /></p>

<p>Disable grouping and folding, load all the symbols (see my <a href="https://adamsitnik.com/Sample-Perf-Investigation/#analysing-the-trace-file">previous blog post</a> to learn how to do it) and export to speedscope format:</p>

<p>What is the cost of “Hello World” compared to the .NET VM Startup?</p>

<p class="center"><img src="./../images/speedscope/HelloWorld.png" alt="HelloWorld" /></p>

<p>What took so long? Let’s zoom it and find out!</p>

<p class="center"><img src="./../images/speedscope/Startup.gif" alt="Startup" /></p>

<h3 id="kudos">Kudos</h3>

<p><a href="https://github.com/jlfwong">Jamie Wong</a> is the author of <a href="https://www.speedscope.app/">speedscope.app</a> who deserves all the credit! I just wrote a simple exporter which allows us to use his awesome tool!</p>]]></content><author><name></name></author><summary type="html"><![CDATA[speedscope.app According to the official web page, speedscope.app is “a fast, interactive web-based viewer for performance profiles”. But I believe it’s more than that! In my opinion, it’s one of the best visualization tools for performance profiles ever! Some time ago I have implemented SpeedScopeExporter which allows exporting any .NET Trace file to a speedscope json file format. It was released as part of 2.0.34 TraceEvent library a few months ago, but so far it was not available for the end users from PerfView GUI/command line level. Yesterday, a new version of PerfView got released with the new possibility to export to speed scope file format. So now the PerfView users can use speedscope.app to view their performance profiles and take advantage of all the goodness it offers!]]></summary></entry><entry><title type="html">Profiling Concurrent .NET Code with BenchmarkDotNet and visualizing it with Concurrency Visualizer</title><link href="https://adamsitnik.com/ConcurrencyVisualizer-Profiler/" rel="alternate" type="text/html" title="Profiling Concurrent .NET Code with BenchmarkDotNet and visualizing it with Concurrency Visualizer" /><published>2019-01-10T00:00:00+00:00</published><updated>2019-01-10T00:00:00+00:00</updated><id>https://adamsitnik.com/ConcurrencyVisualizer-Profiler</id><content type="html" xml:base="https://adamsitnik.com/ConcurrencyVisualizer-Profiler/"><![CDATA[<h1 id="concurrency-visualizer-profiler">Concurrency Visualizer Profiler</h1>

<p>ConcurrencyVisualizerProfiler is the new diagnoser for BenchmarkDotNet that I have implemented some time ago. It was released as part of <code class="language-plaintext highlighter-rouge">0.11.3</code>. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with <a href="https://marketplace.visualstudio.com/items?itemName=Diagnostics.ConcurrencyVisualizer2017">Concurrency Visualizer</a> (plugin for Visual Studio, used to be a part of it).</p>

<p><strong>Again with a single config!</strong>
<!--more--></p>

<h2 id="demo">Demo</h2>

<p>Following code is a real-world benchmark from the <a href="https://github.com/dotnet/machinelearning/blob/b888db28972307a2792b40692591ec67ac08cff0/test/Microsoft.ML.Benchmarks/Text/MultiClassClassification.cs#L66">ML.NET repository</a></p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">ConcurrencyVisualizerProfiler</span><span class="p">]</span> <span class="c1">// !!! use the new diagnoser!!</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">MultiClassClassificationTrain</span>
<span class="p">{</span>
    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">string</span> <span class="n">cmd</span> <span class="p">=</span> <span class="s">@"CV k=5  data="</span> <span class="p">+</span> <span class="n">_dataPath_Wiki</span> <span class="p">+</span>
            <span class="s">" tr=OVA{p=AveragedPerceptron{iter=10}}"</span> <span class="p">+</span>
            <span class="s">" loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}"</span> <span class="p">+</span>
            <span class="s">" xf=Convert{col=logged_in type=R4}"</span> <span class="p">+</span>
            <span class="s">" xf=CategoricalTransform{col=ns}"</span> <span class="p">+</span>
            <span class="s">" xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}"</span> <span class="p">+</span>
            <span class="s">" xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}"</span> <span class="p">+</span>
            <span class="s">" xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}"</span><span class="p">;</span>

        <span class="k">using</span> <span class="p">(</span><span class="kt">var</span> <span class="n">environment</span> <span class="p">=</span> <span class="n">EnvironmentFactory</span><span class="p">.</span><span class="n">CreateClassificationEnvironment</span><span class="p">&lt;</span><span class="n">TextLoader</span><span class="p">,</span> <span class="n">CategoricalTransform</span><span class="p">,</span> <span class="n">AveragedPerceptronTrainer</span><span class="p">&gt;())</span>
        <span class="p">{</span>
            <span class="n">Maml</span><span class="p">.</span><span class="nf">MainCore</span><span class="p">(</span><span class="n">environment</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">alwaysPrintStacktrace</span><span class="p">:</span> <span class="k">false</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The regular output (last two lines are the most important here):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.107 (1809/October2018Update/Redstone5)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-GAEOXF : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

// * Diagnostic Output - ConcurrencyVisualizerProfiler *
Exported 1 CV trace file(s). Example:
c:\Projects\mldotnet\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\20181120-1225-23644\netcoreapp2.1\Microsoft\ML\Benchmarks\MultiClassClassificationTrain\CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC.CvTrace
DO remember that this Diagnoser just tries to mimic the CVCollectionCmd.exe and you need to have Visual Studio with Concurrency Visualizer plugin installed to visualize the data.
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_SDCAMC</td>
        <td style="text-align: right">69.84 s</td>
        <td style="text-align: right">2.608 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>And the new trace file opened with Concurrency Visualizer:</p>

<p class="center"><img src="/images/cvprofiler/utilization_before.png" alt="Utilization" /></p>

<p class="center"><img src="/images/cvprofiler/cores_before.png" alt="Cores" /></p>

<p class="center"><img src="/images/cvprofiler/visible_timeline_profile_before.png" alt="Visible Timeline Profile" /></p>

<h2 id="the-story">The Story</h2>

<p>Recently I have been working on improving the performance of ML.NET (you can read more about it in my <a href="https://adamsitnik.com/Sample-Perf-Investigation/">previous blog post</a>). I wanted to understand the performance characteristics over the time and I knew that ML.NET does most of the things in parallel. FlameGraph is an aggragated form, not over time and per CPU so I could not use it to visualize the data. Few years ago I have been to a presentation where Sasha Goldshtein was using Concurrency Visualizer to easily show which Thread was allocating managed memory and triggering Garbage Collection. I remebered that this is the right tool for visualizing concurrent .NET code.</p>

<p>So I just downloaded it from <a href="https://marketplace.visualstudio.com/items?itemName=Diagnostics.ConcurrencyVisualizer2017">Visual Studio Market Place</a>, read the <a href="https://docs.microsoft.com/en-us/visualstudio/profiling/concurrency-visualizer">docs</a> and watched some <a href="https://channel9.msdn.com/Search?term=Concurrency%20Visualizer">Channel 9 videos</a> about it. (Personal recommendation: don’t ask for permission to read the docs or watch some training videos. This is part of doing the job right, not an extra task which can be omitted)</p>

<p>I started using it, but I did not like the fact that I had to do it manually every time I wanted to run some benchmark:</p>

<p class="center"><img src="/images/cvprofiler/manually_from_vs.png" alt="From Visual Studio" /></p>

<p>With a quick web search I was able to find a command line tool that can do it for me: <a href="https://docs.microsoft.com/en-us/visualstudio/profiling/concurrency-visualizer-command-line-utility-cvcollectioncmd">Concurrency Visualizer command-line utility aka CVCollectionCmd.exe</a></p>

<p>So I switched from VS GUI to this command line tool. But again, my process was not fully automated and I was loosing time doing all this manually.</p>

<p>And then I asked myself two quesitons: how does CVCollectionCmd.exe work? Can I create a BenchmarkDotNet diagnoser out of it?</p>

<h2 id="reverse-engineering">Reverse Engineering</h2>

<p>I am now working for Microsoft so I had two options:</p>

<ul>
  <li>send an email to some discission group and asks who owns the tool and could explain me how it works.</li>
  <li>do some Reverse Engineering and find it out on my own</li>
</ul>

<p>Since I don’t like sending emails (in general) and asking for help when I can find the answer in a short time on my own I decided to use the debugger to attach to CVCollectionCmd.exe and just step into some methods. See how it works and what it generates.</p>

<p>To my suprise, the <strong>code was very clean and very well structured</strong> so finding out how it works was really easy!</p>

<p>So how does CVCollectionCmd.exe work? It creates two ETW sessions (kernel and user), enables some ETW providers and simply collects the data. After the tracing is done, it creates simple XML file with some basic info for the Concurrency Visualizer: process id, used providers and paths to both trace files.</p>

<p>So what I did next, was implementing a new BenchmarkDotNet diagnoser that does exactly the same thing ;)</p>

<h2 id="how-it-works">How it works</h2>

<p><code class="language-plaintext highlighter-rouge">ConcurrencyVisualizerProfiler</code> uses <a href="https://adamsitnik.com/ETW-Profiler/">EtwProfiler</a>, which can be customized to enable requested ETW providers and profile the code.</p>

<p>Before the process with benchmarked code is started, EtwProfiler starts User and Kernel ETW sessions. Every session writes data to it’s own file and captures different data. User session listens for the .NET Runtime events (TPL, ThreadPool, ParallelLinq etc) while the Kernel session gets CPU stacks, context switches, IO events and some more. After this, the process with benchmarked code is started. During the benchmark execution all the data is captured and written to a trace file. Moreover, BenchmarkDotNet Engine emits it’s own events to be able to differentiate jitting, warmup, pilot and actual workload when analyzing the trace file. When the benchmarking is over, both sessions are closed and the two trace files are merged into one.</p>

<p>After both sessions are merged into a single file <code class="language-plaintext highlighter-rouge">ConcurrencyVisualizerProfiler</code> emits an XML file with all the data relevant for Concurrency Visualizer (the Visual Studio plugin). The <code class="language-plaintext highlighter-rouge">.CVTrace</code> file name is reported by the diagnoser, you can find it in the BenchmarkDotNet output:</p>

<pre><code class="language-log">// * Diagnostic Output - ConcurrencyVisualizerProfiler *
Exported 1 CV trace file(s). Example:
Full_path_ommited_for_brevity.CvTrace
DO remember that this Diagnoser just tries to mimic the CVCollectionCmd.exe and you need to have Visual Studio with Concurrency Visualizer plugin installed to visualize the data.
</code></pre>

<p>The difference between the trace files produced by CVCollectionCmd.exe and my new Diagnoser is that the trace files produced by the diagnoser can <strong>be also opened</strong> with <a href="https://github.com/Microsoft/perfview">PerfView</a> and <a href="https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer">Windows Performance Analyzer</a>. It was just a matter of correct naming of the file ;)</p>

<h2 id="limitations">Limitations</h2>

<p>What we have today comes with following limitations:</p>

<ul>
  <li>ConcurrencyVisualizerProfiler works only on Windows</li>
  <li>Requires to run as Admin (to create ETW Kernel Session)</li>
  <li>No <code class="language-plaintext highlighter-rouge">InProcessToolchain</code> support</li>
  <li>To get the best possible managed code symbols you should configure your project in following way:</li>
</ul>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;DebugType&gt;</span>pdbonly<span class="nt">&lt;/DebugType&gt;</span>
<span class="nt">&lt;DebugSymbols&gt;</span>true<span class="nt">&lt;/DebugSymbols&gt;</span>
</code></pre></div></div>

<h2 id="how-to-use-it">How to use it?</h2>

<p>You need to install latest <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Diagnostics.Windows</code> package. It can be enabled in few ways, some of them:</p>

<ul>
  <li>Use the new attribute (apply it on a class that contains Benchmarks):</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">ConcurrencyVisualizerProfiler</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">TheClassThatContainsBenchmarks</span> <span class="p">{</span> <span class="cm">/* benchmarks go here */</span> <span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Extend the <code class="language-plaintext highlighter-rouge">DefaultConfig.Instance</code> with new instance of <code class="language-plaintext highlighter-rouge">ConcurrencyVisualizerProfiler</code>:</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> 
        <span class="p">=&gt;</span> <span class="n">BenchmarkSwitcher</span>
            <span class="p">.</span><span class="nf">FromAssembly</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">Program</span><span class="p">).</span><span class="n">Assembly</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">Run</span><span class="p">(</span><span class="n">args</span><span class="p">,</span>
                <span class="n">DefaultConfig</span><span class="p">.</span><span class="n">Instance</span>
                    <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="k">new</span> <span class="nf">ConcurrencyVisualizerProfiler</span><span class="p">()));</span> <span class="c1">// HERE</span>
<span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Passing <code class="language-plaintext highlighter-rouge">-p CV</code> or <code class="language-plaintext highlighter-rouge">--profiler CV</code> command line argument to <code class="language-plaintext highlighter-rouge">BenchmarkSwitcher</code></li>
</ul>

<h2 id="how-to-open-the-cvtrace-file-in-visual-studio">How to open the .CVTrace file in Visual Studio</h2>

<p>After installing Concurrency Visualizer from <a href="https://marketplace.visualstudio.com/items?itemName=Diagnostics.ConcurrencyVisualizer2017">Visual Studio Market Place</a> you just need to go to: Analyze -&gt; Concurrency Visualizer -&gt; Open Trace</p>

<p class="center"><img src="/images/cvprofiler/open_trace.png" alt="Open Trace In Visual Studio" /></p>

<h2 id="sample-usages">Sample usages</h2>

<h3 id="cpu-utilization">CPU Utilization</h3>

<p>Using the new diagnoser to find out what is the CPU utilization for <code class="language-plaintext highlighter-rouge">BinaryTrees_5</code> benchmark in the <a href="https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/coreclr/BenchmarksGame/binarytrees-5.cs">dotnet/performance</a> repository.</p>

<pre><code class="language-cmd">dotnet run -c Release -f netcoreapp2.1 --filter *BinaryTrees_5* --profiler CV
</code></pre>

<p class="center"><img src="/images/cvprofiler/binary_trees_5.png" alt="Binary Trees 5" /></p>

<p>Note: As you can see, we are using 2 - 2.5 CPUs out of twelve!</p>

<h3 id="synchronization">Synchronization</h3>

<p>Using the new diagnoser to find out what % of time is spent for synchronization for <code class="language-plaintext highlighter-rouge">SpectralNorm_3</code> benchmark in the <a href="https://github.com/dotnet/performance/blob/master/src/benchmarks/micro/coreclr/BenchmarksGame/spectralnorm-3.cs">dotnet/performance</a> repository.</p>

<pre><code class="language-cmd">dotnet run -c Release -f netcoreapp2.1 --filter *spectralnorm_3* --profiler CV
</code></pre>

<p class="center"><img src="/images/cvprofiler/spectral_norm_3.png" alt="Spectral Norm 3" /></p>

<p>Note: 90% of the time is spent for synchronization.</p>

<h3 id="special-thanks">Special Thanks</h3>

<p>I wanted to thank <a href="https://wojciechnagorski.com/">Wojciech Nagórski</a> who has fixed two bugs (<a href="https://github.com/dotnet/BenchmarkDotNet/pull/962">#962</a>, <a href="https://github.com/dotnet/BenchmarkDotNet/pull/958">#958</a>) that previously required all <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Diagnostics.Windows</code> users to use some weird workarounds to get it working. Thanks to Wojciech all you need to do is to install the package!</p>

<h2 id="summary">Summary</h2>

<p>Concurrency Visualizer is very powerfull tool that can visualize concurrent code in user-friendly way. It’s a plugin for Visual Studio which can be downloaded for free from <a href="https://marketplace.visualstudio.com/items?itemName=Diagnostics.ConcurrencyVisualizer2017">Visual Studio Market Place</a>.</p>

<p>I really encourage you to read the <a href="https://docs.microsoft.com/en-us/visualstudio/profiling/concurrency-visualizer">docs</a> and watch some <a href="https://channel9.msdn.com/Search?term=Concurrency%20Visualizer">Channel 9 videos</a> and give it a try.</p>

<p>With the new BenchmarkDotNet feature you can get the trace file by running a single command from the console!</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Concurrency Visualizer Profiler ConcurrencyVisualizerProfiler is the new diagnoser for BenchmarkDotNet that I have implemented some time ago. It was released as part of 0.11.3. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with Concurrency Visualizer (plugin for Visual Studio, used to be a part of it). Again with a single config!]]></summary></entry><entry><title type="html">Sample performance investigation using BenchmarkDotNet and PerfView</title><link href="https://adamsitnik.com/Sample-Perf-Investigation/" rel="alternate" type="text/html" title="Sample performance investigation using BenchmarkDotNet and PerfView" /><published>2018-11-14T00:00:00+00:00</published><updated>2018-11-14T00:00:00+00:00</updated><id>https://adamsitnik.com/Sample-Perf-Investigation</id><content type="html" xml:base="https://adamsitnik.com/Sample-Perf-Investigation/"><![CDATA[<h1 id="introduction">Introduction</h1>

<p>Part of my job on the .NET Team is to improve the performance of existing .NET libraries. My current goal is to identify performance bottlenecks in ML.NET and recognize common performance issues that should be addressed by .NET framework.</p>

<p>In this blog post, I am describing how I approach sample performance problem using available free .NET tools and best practices for performance engineering.</p>

<!--more-->

<h2 id="benchmark">Benchmark</h2>

<p>The first thing I need is a good benchmark which tests the performance of the feature that I care about. By good benchmark, I mean something that measures only the thing that I am interested in and produces accurate, stable and repeatable results.</p>

<p><a href="https://github.com/dotnet/machinelearning/tree/master/test/Microsoft.ML.Benchmarks">ML.NET repository</a> has many benchmarks and it’s already using a very good tool for benchmarking (yes, it’s of course BenchmarkDotNet ;) ).</p>

<p>The first thing I do is running all of the real-life scenario benchmarks, order them by time (descending) and importance (based on the info from the manager) and choose top 1.</p>

<ul>
  <li>Why do I choose only the real-life scenario benchmarks? Because I want to improve the end user experience. I don’t want to improve a micro-benchmark which tests only a piece of the end product.</li>
  <li>Why do I take the most time-consuming benchmark? Because the longer it takes to execute some code, the more performance issues it probably has. I don’t have an infinite amount of time and I want to make an impact.</li>
  <li>Why do I focus on the scenarios pointed by the manager? Because the manager knows what is important for our customers. If I solve a perf issue in a scenario that nobody cares about it’s not worth too much ;)</li>
</ul>

<p>So in my case the benchmark that I decided to focus on is <a href="https://github.com/dotnet/machinelearning/blob/b888db28972307a2792b40692591ec67ac08cff0/test/Microsoft.ML.Benchmarks/Text/MultiClassClassification.cs#L66">CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron</a> which looks like this:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
<span class="k">public</span> <span class="k">void</span> <span class="nf">CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">string</span> <span class="n">cmd</span> <span class="p">=</span> <span class="s">@"CV k=5  data="</span> <span class="p">+</span> <span class="n">_dataPath_Wiki</span> <span class="p">+</span>
        <span class="s">" tr=OVA{p=AveragedPerceptron{iter=10}}"</span> <span class="p">+</span>
        <span class="s">" loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}"</span> <span class="p">+</span>
        <span class="s">" xf=Convert{col=logged_in type=R4}"</span> <span class="p">+</span>
        <span class="s">" xf=CategoricalTransform{col=ns}"</span> <span class="p">+</span>
        <span class="s">" xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}"</span> <span class="p">+</span>
        <span class="s">" xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}"</span> <span class="p">+</span>
        <span class="s">" xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}"</span><span class="p">;</span>

    <span class="k">using</span> <span class="p">(</span><span class="kt">var</span> <span class="n">environment</span> <span class="p">=</span> <span class="n">EnvironmentFactory</span><span class="p">.</span><span class="n">CreateClassificationEnvironment</span><span class="p">&lt;</span><span class="n">TextLoader</span><span class="p">,</span> <span class="n">CategoricalTransform</span><span class="p">,</span> <span class="n">AveragedPerceptronTrainer</span><span class="p">&gt;())</span>
    <span class="p">{</span>
        <span class="n">Maml</span><span class="p">.</span><span class="nf">MainCore</span><span class="p">(</span><span class="n">environment</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">alwaysPrintStacktrace</span><span class="p">:</span> <span class="k">false</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="profiler">Profiler</h2>

<p>Benchmark can tell me only how long it takes to execute given piece of code. What I also need is a Profiler to find out which methods are being executed and for how long. In my case, I am going to use <a href="https://adamsitnik.com/ETW-Profiler/">ETWProfiler</a> which is just a BenchmarkDotNet plugin that uses ETW for profiling.</p>

<h2 id="run-the-benchmark-before-applying-any-changes">Run the benchmark before applying any changes</h2>

<p>To choose the benchmark I am using <code class="language-plaintext highlighter-rouge">--filter</code> option, to use ETWProfiler just <code class="language-plaintext highlighter-rouge">--profiler ETW</code>. Moreover I want to store the results in a dedicated folder to be able to compare them later after I apply some improvements. For this purpose I am using <code class="language-plaintext highlighter-rouge">--artifacts</code>.</p>

<p>I don’t want the benchmark or profile to include any noise, so I <strong>close ALL applications except a single command line window!</strong></p>

<p>And run following command:</p>

<pre><code class="language-log">dotnet run -c Release -- --filter *WikiDetox_WordEmbeddings_OVAAveragedPerceptron --profiler ETW --artifacts .\BenchmarkDotNet.Artifacts\before
</code></pre>

<p>The regular output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.11.2, OS=Windows 10.0.17134.345 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507503 Hz, Resolution=285.1031 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-009697
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-OXDQNP : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp2.1  IterationCount=1  
LaunchCount=3 RunStrategy=ColdStart  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron</td>
        <td style="text-align: right">286.7 s</td>
        <td style="text-align: right">5.650 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>And the path to Trace file with Profile information:</p>

<pre><code class="language-log">// * Diagnostic Output - EtwProfiler *
Exported 1 trace file(s). Example:
C:\Projects\mldotnet\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\20181109-0426-20620\netcoreapp2.1\Microsoft\ML\Benchmarks\MultiClassClassificationTrain\CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron.etl
</code></pre>

<h2 id="analysing-the-trace-file">Analysing the Trace File</h2>

<p>To analyse the data from the Trace file I am using PerfView, which is a free .NET profiler from Microsoft. If you are not familiar with PerfView you should watch <a href="https://channel9.msdn.com/Series/PerfView-Tutorial">these instructional videos</a> first. These videos were recorded by .NET Performance Architect and PerfView creator - Vance Morisson. Trust me, it’s really worth watching these videos!!!</p>

<p class="center"><img src="/images/profiling_ml_1/one_does_not.jpg" alt="PerfViewUX" /></p>

<p>The first thing I need to do is to open the trace file in PerfView and choose “dotnet –benchmarkName (…)” from CPU Stacks Window (PerfView sorts them descending by CPU consumption):</p>

<p class="center"><img src="/images/profiling_ml_1/select_process.png" alt="Select process" /></p>

<p>The trace file contains symbols for the managed methods emitted by CLR during CLR Rundown. However, it does not contain the native method symbols. But don’t worry! PerfView noticed that the <code class="language-plaintext highlighter-rouge">.pdb</code> file with native symbols is stored on my disk. I have just built this file and I trust it, so I click <code class="language-plaintext highlighter-rouge">Yes</code>.</p>

<p class="center"><img src="/images/profiling_ml_1/security_check.png" alt="Security check" /></p>

<p>In my previous blog post, I have described how to use PerfView to filter Trace files produced by BenchmarkDotNet to set the time range to the actual benchmark execution. You can read it <a href="https://adamsitnik.com/ETW-Profiler/#using-perfview-to-work-with-trace-files">here</a>. In this particular benchmark, I don’t set the time range because the benchmark is executed just once and moreover I do care about CLR startup. So I am interested in the entire process lifetime. When the benchmark is executed many times I filter the trace to a single benchmark iteration as described in the previous blog post.</p>

<p>Now I go directly to the FlameGraph tab to get some quick overview:</p>

<p class="center"><img src="/images/profiling_ml_1/flame_default_filters.png" alt="Default filters" /></p>

<p>Is it all I need? No! PerfView by default groups the data by modules. I disable the module grouping (<code class="language-plaintext highlighter-rouge">GroupPats = [no grouping]</code>)</p>

<p class="center"><img src="/images/profiling_ml_1/flame_no_group.png" alt="No grouping" /></p>

<p>But where is the missing data? Most probably hidden by Folding. So let’s set <code class="language-plaintext highlighter-rouge">Fold%=0</code></p>

<p class="center"><img src="/images/profiling_ml_1/flame_no_fold.png" alt="No fold" /></p>

<p>By hovering the mouse over the flame boxes I can see that the code is multi-threaded. And even with Flamegraph, it’s hard to read! So let’s group the data by setting <code class="language-plaintext highlighter-rouge">GroupPats = Thread -&gt;AllThreads</code> (mind the spaces!)</p>

<p class="center"><img src="/images/profiling_ml_1/flame_group_thread.png" alt="AllThreads" /></p>

<p>And let’s set folding to 1% again to get it human-friendly:</p>

<p class="center"><img src="/images/profiling_ml_1/flame_human.png" alt="Folding" /></p>

<p>But as you might notice, some FlameBoxes contain <code class="language-plaintext highlighter-rouge">?!</code> in  their names. It means that we need to load the symbols for these methods. The easiest way of doing this is to go to the <code class="language-plaintext highlighter-rouge">By name</code> tab, and press <code class="language-plaintext highlighter-rouge">Ctrl+A</code> (select all) and then <code class="language-plaintext highlighter-rouge">Alt+S</code> (load symbols).</p>

<p class="center"><img src="/images/profiling_ml_1/flames_symbols.png" alt="Symbols loaded" /></p>

<p>IMHO now I have a very good overview and I can start the analysis!</p>

<h2 id="the-bottleneck">The Bottleneck</h2>

<p>By just looking at the Flamegraph of entire process lifetime I could say that <code class="language-plaintext highlighter-rouge">Parsing</code> might be one of my bottlenecks (it’s the biggest box). But it’s not enough to identify a bottleneck.</p>

<p class="center"><img src="/images/profiling_ml_1/flame_biggest_block.png" alt="Biggest block" /></p>

<p>When I switch to the “By name” tab I can see all the methods sorted descending by exclusive time (the ones that does actual computations).</p>

<p class="center"><img src="/images/profiling_ml_1/by_name.png" alt="By name" /></p>

<p>But the most important information is visible in the simple histogram:</p>

<p class="center"><img src="/images/profiling_ml_1/histogram.png" alt="By name" /></p>

<p>What does this information tell us? That parsing (red block 2) is a performance bottleneck here! Moreover, as you can see, Flamegraph itself gives a great overview but does not tell us about the performance over time. This simple histogram does!</p>

<h2 id="isolate-the-bottleneck">Isolate the bottleneck</h2>

<p>The next step is to write a benchmark that isoloates the bottleneck.</p>

<p>In my case it’s following benchmark:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
<span class="k">public</span> <span class="n">WordEmbeddingsTransform</span> <span class="nf">CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">string</span> <span class="n">cmd</span> <span class="p">=</span> <span class="s">@"CV k=5  data="</span> <span class="p">+</span> <span class="n">_dataPath_Wiki</span> <span class="p">+</span>
        <span class="s">" tr=OVA{p=AveragedPerceptron{iter=10}}"</span> <span class="p">+</span>
        <span class="s">" loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+}"</span> <span class="p">+</span>
        <span class="s">" xf=Convert{col=logged_in type=R4}"</span> <span class="p">+</span>
        <span class="s">" xf=CategoricalTransform{col=ns}"</span> <span class="p">+</span>
        <span class="s">" xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}}"</span> <span class="p">+</span>
        <span class="s">" xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D}"</span> <span class="p">+</span>
        <span class="s">" xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}"</span><span class="p">;</span>

    <span class="k">using</span> <span class="p">(</span><span class="kt">var</span> <span class="n">environment</span> <span class="p">=</span> <span class="n">EnvironmentFactory</span><span class="p">.</span><span class="n">CreateClassificationEnvironment</span><span class="p">&lt;</span><span class="n">TextLoader</span><span class="p">,</span> <span class="n">CategoricalTransform</span><span class="p">,</span> <span class="n">AveragedPerceptronTrainer</span><span class="p">&gt;())</span>
    <span class="p">{</span>
        <span class="k">return</span> <span class="k">new</span> <span class="nf">WordEmbeddingsTransform</span><span class="p">(</span>
            <span class="n">environment</span><span class="p">,</span>
            <span class="n">modelKind</span><span class="p">:</span> <span class="n">WordEmbeddingsTransform</span><span class="p">.</span><span class="n">PretrainedModelKind</span><span class="p">.</span><span class="n">FastTextWikipedia300D</span><span class="p">,</span>
            <span class="k">new</span> <span class="n">WordEmbeddingsTransform</span><span class="p">.</span><span class="nf">ColumnInfo</span><span class="p">(</span><span class="s">"FeaturesText_TransformedText"</span><span class="p">,</span> <span class="s">"FeaturesWordEmbedding"</span><span class="p">));</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now I again turn off everything and run the benchmark with <code class="language-plaintext highlighter-rouge">ETWProfiler</code> enabled. The results I get are:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse</td>
        <td style="text-align: right">153.0 s</td>
      </tr>
    </tbody>
  </table>

</div>

<h2 id="analyse-the-bottlneck-profile">Analyse the bottlneck profile</h2>

<p>After some filtering in PerfView we can see following Flamegraph:</p>

<p class="center"><img src="/images/profiling_ml_1/flame_bottleneck.png" alt="By name" /></p>

<p>Explanation:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">float.TryParse</code> - 56% - it’s the cost of parsing a float, there is very little we can do about it (quickly)</li>
  <li><code class="language-plaintext highlighter-rouge">NumberFormatInfo.CurrentInfo</code> - 8% - anytime we call <code class="language-plaintext highlighter-rouge">float.TryParse</code> and not provide the <code class="language-plaintext highlighter-rouge">NumberFormatInfo</code> the framework calls <code class="language-plaintext highlighter-rouge">NumberFormatInfo.CurrentInfo</code>. We can easily read it once and provide in explicit way to save the 8%.</li>
  <li><code class="language-plaintext highlighter-rouge">String.Split</code> - 15% - we should not be using <code class="language-plaintext highlighter-rouge">Split</code> when we can do slicing with <a href="https://adamsitnik.com/Span/#slicing-without-managed-heap-allocations">Span!</a></li>
  <li><code class="language-plaintext highlighter-rouge">StreamReader.ReadLine</code> - 12% - it’s the cost of reading a file, there is very little we can do about it (quickly)</li>
  <li><code class="language-plaintext highlighter-rouge">String.Substring</code> - 1% - again we should not be using <code class="language-plaintext highlighter-rouge">Substring</code> when we can do slicing with <a href="https://adamsitnik.com/Span/#slicing-without-managed-heap-allocations">Span!</a></li>
</ol>

<h2 id="make-sure-the-code-has-unit-test-coverage">Make sure the code has unit test coverage</h2>

<p>Before I apply any optimizations I need to make sure that I have some unit tests to not introduce any new bugs! I don’t know the product well and I don’t want to waste my time for manual testing. Moreover, having a test commited before the changes will make it more likely for the project maintainers to accepty my optimizations in a PR.</p>

<p>So I just write a following test first:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">LineParserTests</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="k">static</span> <span class="n">IEnumerable</span><span class="p">&lt;</span><span class="kt">object</span><span class="p">[</span><span class="k">]&gt;</span> <span class="nf">ValidInputs</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">line</span> <span class="k">in</span> <span class="k">new</span> <span class="kt">string</span><span class="p">[]</span>
        <span class="p">{</span>
            <span class="s">"key 0.1 0.2 0.3"</span><span class="p">,</span> <span class="s">"key 0.1 0.2 0.3 "</span><span class="p">,</span>
            <span class="s">"key\t0.1\t0.2\t0.3"</span><span class="p">,</span> <span class="s">"key\t0.1\t0.2\t0.3\t"</span> <span class="c1">// tab can also be a separator</span>
        <span class="p">})</span>
        <span class="p">{</span>
            <span class="k">yield</span> <span class="k">return</span> <span class="k">new</span> <span class="kt">object</span><span class="p">[]</span> <span class="p">{</span> <span class="n">line</span><span class="p">,</span> <span class="s">"key"</span><span class="p">,</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[]</span> <span class="p">{</span> <span class="m">0.1f</span><span class="p">,</span> <span class="m">0.2f</span><span class="p">,</span> <span class="m">0.3f</span> <span class="p">}</span> <span class="p">};</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Theory</span><span class="p">]</span>
    <span class="p">[</span><span class="nf">MemberData</span><span class="p">(</span><span class="k">nameof</span><span class="p">(</span><span class="n">ValidInputs</span><span class="p">))]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">WhenProvidedAValidInputParserParsesKeyAndValues</span><span class="p">(</span><span class="kt">string</span> <span class="n">input</span><span class="p">,</span> <span class="kt">string</span> <span class="n">expectedKey</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">expectedValues</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">result</span> <span class="p">=</span> <span class="n">Transforms</span><span class="p">.</span><span class="n">Text</span><span class="p">.</span><span class="n">LineParser</span><span class="p">.</span><span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="n">input</span><span class="p">);</span>

        <span class="n">Assert</span><span class="p">.</span><span class="nf">True</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">isSuccess</span><span class="p">);</span>
        <span class="n">Assert</span><span class="p">.</span><span class="nf">Equal</span><span class="p">(</span><span class="n">expectedKey</span><span class="p">,</span> <span class="n">result</span><span class="p">.</span><span class="n">key</span><span class="p">);</span>
        <span class="n">Assert</span><span class="p">.</span><span class="nf">Equal</span><span class="p">(</span><span class="n">expectedValues</span><span class="p">,</span> <span class="n">result</span><span class="p">.</span><span class="n">values</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Theory</span><span class="p">]</span>
    <span class="p">[</span><span class="nf">InlineData</span><span class="p">(</span><span class="s">""</span><span class="p">)]</span>
    <span class="p">[</span><span class="nf">InlineData</span><span class="p">(</span><span class="s">"key 0.1 NOT_A_NUMBER"</span><span class="p">)]</span> <span class="c1">// invalid number</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">WhenProvidedAnInvalidInputParserReturnsFailure</span><span class="p">(</span><span class="kt">string</span> <span class="n">input</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="n">Assert</span><span class="p">.</span><span class="nf">False</span><span class="p">(</span><span class="n">Transforms</span><span class="p">.</span><span class="n">Text</span><span class="p">.</span><span class="n">LineParser</span><span class="p">.</span><span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="n">input</span><span class="p">).</span><span class="n">isSuccess</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Writing a test first is a future investment that pays off very quickly! I have never regretted writing a test, but the few times I didn’t write a test I had to pay for it later..</p>

<p>Once I have the tests I move the existing parsing logic to a dedicated method:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">static</span> <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="kt">string</span> <span class="n">line</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">char</span><span class="p">[]</span> <span class="n">delimiters</span> <span class="p">=</span> <span class="p">{</span> <span class="sc">' '</span><span class="p">,</span> <span class="sc">'\t'</span> <span class="p">};</span>
    <span class="kt">string</span><span class="p">[]</span> <span class="n">words</span> <span class="p">=</span> <span class="n">line</span><span class="p">.</span><span class="nf">TrimEnd</span><span class="p">().</span><span class="nf">Split</span><span class="p">(</span><span class="n">delimiters</span><span class="p">);</span>
    <span class="kt">string</span> <span class="n">key</span> <span class="p">=</span> <span class="n">words</span><span class="p">[</span><span class="m">0</span><span class="p">];</span>
    <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span> <span class="p">=</span> <span class="n">words</span><span class="p">.</span><span class="nf">Skip</span><span class="p">(</span><span class="m">1</span><span class="p">).</span><span class="nf">Select</span><span class="p">(</span><span class="n">x</span> <span class="p">=&gt;</span> <span class="kt">float</span><span class="p">.</span><span class="nf">TryParse</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="k">out</span> <span class="kt">var</span> <span class="n">tmp</span><span class="p">)</span> <span class="p">?</span> <span class="n">tmp</span> <span class="p">:</span> <span class="n">Single</span><span class="p">.</span><span class="n">NaN</span><span class="p">).</span><span class="nf">ToArray</span><span class="p">();</span>
    <span class="k">if</span> <span class="p">(!</span><span class="n">values</span><span class="p">.</span><span class="nf">Contains</span><span class="p">(</span><span class="n">Single</span><span class="p">.</span><span class="n">NaN</span><span class="p">))</span>
        <span class="k">return</span> <span class="p">(</span><span class="k">true</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">values</span><span class="p">);</span>

    <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="closer-look">Closer look</h2>

<p>Let’s analyse the code from perf perspective line by line:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">char[] delimiters = { ' ', '\t' };</code> - the array is alocated every time the method is called. It should be moved to a static readonly field. (in the original code it was allocated once per file so it was not that bad)</li>
  <li><code class="language-plaintext highlighter-rouge">line.TrimEnd</code> - this method allocates new string if the trimming is required</li>
  <li><code class="language-plaintext highlighter-rouge">Split(delimiters)</code> - this method allocates an array of strings and the strings themselves</li>
  <li><code class="language-plaintext highlighter-rouge">words.Skip(1).Select(x =&gt; float.TryParse(x, out var tmp) ? tmp : Single.NaN).ToArray()</code> - every LINQ method allocates an enumerator. Moreover <code class="language-plaintext highlighter-rouge">ToArray</code> allocates entire array. Typically it’s not an issue, but here every cycle matters (we are on a very hot path).</li>
  <li><code class="language-plaintext highlighter-rouge">values.Contains(Single.NaN)</code> - contains is <code class="language-plaintext highlighter-rouge">O(n)</code>, it’s not required here. We should just stop when <code class="language-plaintext highlighter-rouge">TryParse</code> returns false.</li>
</ul>

<h2 id="apply-the-optimizations">Apply the optimizations</h2>

<p>If we target .NET Standard 2.0 we can’t use all the methods that accept <a href="http://adamsitnik.com/Span/">Span</a> like <code class="language-plaintext highlighter-rouge">float.TryParse(ReadOnlySpan&lt;char&gt;)</code> so we just remove the LINQ and move <code class="language-plaintext highlighter-rouge">NumberFormatInfo.CurrentInfo</code> outside of the loop:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">static</span> <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="kt">string</span> <span class="n">line</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="kt">string</span><span class="p">.</span><span class="nf">IsNullOrWhiteSpace</span><span class="p">(</span><span class="n">line</span><span class="p">))</span>
        <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>

    <span class="kt">string</span><span class="p">[]</span> <span class="n">words</span> <span class="p">=</span> <span class="n">line</span><span class="p">.</span><span class="nf">TrimEnd</span><span class="p">().</span><span class="nf">Split</span><span class="p">(</span><span class="n">_delimiters</span><span class="p">);</span>

    <span class="n">NumberFormatInfo</span> <span class="n">info</span> <span class="p">=</span> <span class="n">NumberFormatInfo</span><span class="p">.</span><span class="n">CurrentInfo</span><span class="p">;</span> <span class="c1">// moved otuside the loop to save  8% of the time</span>
    <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">words</span><span class="p">.</span><span class="n">Length</span> <span class="p">-</span> <span class="m">1</span><span class="p">];</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">words</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="kt">float</span><span class="p">.</span><span class="nf">TryParse</span><span class="p">(</span><span class="n">words</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">NumberStyles</span><span class="p">.</span><span class="n">Float</span> <span class="p">|</span> <span class="n">NumberStyles</span><span class="p">.</span><span class="n">AllowThousands</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="k">out</span> <span class="kt">float</span> <span class="n">parsed</span><span class="p">))</span>
            <span class="n">values</span><span class="p">[</span><span class="n">i</span> <span class="p">-</span> <span class="m">1</span><span class="p">]</span> <span class="p">=</span> <span class="n">parsed</span><span class="p">;</span>
        <span class="k">else</span>
            <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span> <span class="c1">// fail as soon as something is wrong</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="p">(</span><span class="k">true</span><span class="p">,</span> <span class="n">words</span><span class="p">[</span><span class="m">0</span><span class="p">],</span> <span class="n">values</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which gives us following result:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse</td>
        <td style="text-align: right">141.0 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>Which is exactly the 8% we saved by moving <code class="language-plaintext highlighter-rouge">NumberFormatInfo.CurrentInfo</code> outside of the loop. I am not happy about the fact that I had to use such trick to make it faster, so I reported an <a href="https://github.com/dotnet/coreclr/issues/20938">issue</a> in the JIT repo.</p>

<p>However, with .NET Standard 2.1 or just .NET Core 2.1+ we can take full advantage of Span!</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">static</span> <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="kt">string</span> <span class="n">line</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="kt">string</span><span class="p">.</span><span class="nf">IsNullOrWhiteSpace</span><span class="p">(</span><span class="n">line</span><span class="p">))</span>
        <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>

    <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="n">trimmedLine</span> <span class="p">=</span> <span class="n">line</span><span class="p">.</span><span class="nf">AsSpan</span><span class="p">().</span><span class="nf">TrimEnd</span><span class="p">();</span> <span class="c1">// TrimEnd creates a Span, no allocations</span>

    <span class="kt">int</span> <span class="n">firstSeparatorIndex</span> <span class="p">=</span> <span class="n">trimmedLine</span><span class="p">.</span><span class="nf">IndexOfAny</span><span class="p">(</span><span class="sc">' '</span><span class="p">,</span> <span class="sc">'\t'</span><span class="p">);</span> <span class="c1">// the first word is the key, we just skip it</span>
    <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="n">valuesToParse</span> <span class="p">=</span> <span class="n">trimmedLine</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="n">start</span><span class="p">:</span> <span class="n">firstSeparatorIndex</span> <span class="p">+</span> <span class="m">1</span><span class="p">);</span>

    <span class="kt">int</span> <span class="n">valuesCount</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="c1">// we count the number of values first to allocate a single array with of proper size</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">valuesToParse</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">' '</span> <span class="p">||</span> <span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">'\t'</span><span class="p">)</span>
            <span class="n">valuesCount</span><span class="p">++;</span>

    <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">valuesCount</span> <span class="p">+</span> <span class="m">1</span><span class="p">];</span> <span class="c1">// + 1 because the line is trimmed and there is no whitespace at the end</span>
    <span class="kt">int</span> <span class="n">textStart</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">valueIndex</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
    <span class="n">NumberFormatInfo</span> <span class="n">info</span> <span class="p">=</span> <span class="n">NumberFormatInfo</span><span class="p">.</span><span class="n">CurrentInfo</span><span class="p">;</span> <span class="c1">// moved otuside the loop to save  8% of the time</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;=</span> <span class="n">valuesToParse</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="p">==</span> <span class="n">valuesToParse</span><span class="p">.</span><span class="n">Length</span> <span class="p">||</span> <span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">' '</span> <span class="p">||</span> <span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">'\t'</span><span class="p">)</span>
        <span class="p">{</span>
            <span class="kt">var</span> <span class="n">toParse</span> <span class="p">=</span> <span class="n">valuesToParse</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="n">textStart</span><span class="p">,</span> <span class="n">i</span> <span class="p">-</span> <span class="n">textStart</span><span class="p">);</span>

            <span class="k">if</span> <span class="p">(</span><span class="kt">float</span><span class="p">.</span><span class="nf">TryParse</span><span class="p">(</span><span class="n">toParse</span><span class="p">,</span> <span class="n">NumberStyles</span><span class="p">.</span><span class="n">Float</span> <span class="p">|</span> <span class="n">NumberStyles</span><span class="p">.</span><span class="n">AllowThousands</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="k">out</span> <span class="kt">float</span> <span class="n">parsed</span><span class="p">))</span>
                <span class="n">values</span><span class="p">[</span><span class="n">valueIndex</span><span class="p">++]</span> <span class="p">=</span> <span class="n">parsed</span><span class="p">;</span>
            <span class="k">else</span>
                <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>

            <span class="n">textStart</span> <span class="p">=</span> <span class="n">i</span> <span class="p">+</span> <span class="m">1</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="p">(</span><span class="k">true</span><span class="p">,</span> <span class="k">new</span> <span class="kt">string</span><span class="p">(</span><span class="n">trimmedLine</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">firstSeparatorIndex</span><span class="p">)),</span> <span class="n">values</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Which gives us following result:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse</td>
        <td style="text-align: right">129.1 s</td>
      </tr>
    </tbody>
  </table>

</div>

<ul>
  <li>Does the code look cleaner? No! It’s harder to understand what it does. I have sacrificed  readability for performance only because the gain was worth it. I don’t do it by default in every place of our app. You also should not.</li>
  <li>Does it produce correct results! Yes, I have unit tests which test the correctness.</li>
</ul>

<h2 id="utf8parser">Utf8Parser</h2>

<p>The code sample with Span looks really complicated.. It would be nice if .NET could provide some primitives to help with such scenarios. The good news is that .NET Core 2.1 has introduced a new type called <code class="language-plaintext highlighter-rouge">Utf8Parser</code>. Honestly, I forgot about its existence, but my teammate Tanner reminded me of that in <a href="https://github.com/dotnet/machinelearning/pull/1599#issuecomment-438141518">code review</a>. Thank you Tanner!</p>

<p>Using <code class="language-plaintext highlighter-rouge">Utf8Parser</code> simplifies my code a lot:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">internal</span> <span class="k">static</span> <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">line</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="n">IsEmpty</span><span class="p">)</span>
        <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>

    <span class="kt">int</span> <span class="n">firstSeparatorIndex</span> <span class="p">=</span> <span class="n">line</span><span class="p">.</span><span class="nf">IndexOfAny</span><span class="p">((</span><span class="kt">byte</span><span class="p">)</span><span class="sc">' '</span><span class="p">,</span> <span class="p">(</span><span class="kt">byte</span><span class="p">)</span><span class="sc">'\t'</span><span class="p">);</span> <span class="c1">// the first word is the key, we just skip it</span>
    <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">valuesToParse</span> <span class="p">=</span> <span class="n">line</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="n">start</span><span class="p">:</span> <span class="n">firstSeparatorIndex</span> <span class="p">+</span> <span class="m">1</span><span class="p">);</span>

    <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span> <span class="p">=</span> <span class="nf">AllocateFixedSizeArrayToStoreParsedValues</span><span class="p">(</span><span class="n">valuesToParse</span><span class="p">);</span>

    <span class="kt">int</span> <span class="n">toParseStartIndex</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">valueIndex</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">valueIndex</span> <span class="p">&lt;</span> <span class="n">values</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">valueIndex</span><span class="p">++)</span>
    <span class="p">{</span>
        <span class="k">if</span> <span class="p">(!</span><span class="n">Utf8Parser</span><span class="p">.</span><span class="nf">TryParse</span><span class="p">(</span><span class="n">valuesToParse</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="n">start</span><span class="p">:</span> <span class="n">toParseStartIndex</span><span class="p">),</span> <span class="k">out</span> <span class="kt">float</span> <span class="n">parsed</span><span class="p">,</span> <span class="k">out</span> <span class="kt">int</span> <span class="n">bytesConsumed</span><span class="p">))</span>
            <span class="k">return</span> <span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="k">null</span><span class="p">,</span> <span class="k">null</span><span class="p">);</span>

        <span class="n">values</span><span class="p">[</span><span class="n">valueIndex</span><span class="p">]</span> <span class="p">=</span> <span class="n">parsed</span><span class="p">;</span>
        <span class="n">toParseStartIndex</span> <span class="p">+=</span> <span class="n">bytesConsumed</span> <span class="p">+</span> <span class="m">1</span><span class="p">;</span> <span class="c1">// + 1 is for the whitespace!</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="p">(</span><span class="k">true</span><span class="p">,</span> <span class="n">Encoding</span><span class="p">.</span><span class="n">UTF8</span><span class="p">.</span><span class="nf">GetString</span><span class="p">(</span><span class="n">line</span><span class="p">.</span><span class="nf">Slice</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">firstSeparatorIndex</span><span class="p">)),</span> <span class="n">values</span><span class="p">);</span>
<span class="p">}</span>

<span class="c1">/// &lt;summary&gt;</span>
<span class="c1">/// we count the number of values first to allocate a single array with of proper size</span>
<span class="c1">/// &lt;/summary&gt;</span>
<span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">AggressiveInlining</span><span class="p">)]</span>
<span class="k">private</span> <span class="k">static</span> <span class="kt">float</span><span class="p">[]</span> <span class="nf">AllocateFixedSizeArrayToStoreParsedValues</span><span class="p">(</span><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">valuesToParse</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">int</span> <span class="n">valuesCount</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">valuesToParse</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">' '</span> <span class="p">||</span> <span class="n">valuesToParse</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">==</span> <span class="sc">'\t'</span><span class="p">)</span>
            <span class="n">valuesCount</span><span class="p">++;</span>

    <span class="k">return</span> <span class="k">new</span> <span class="kt">float</span><span class="p">[</span><span class="n">valuesCount</span><span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="whats-next">What’s next?</h2>

<p>Let’s take a look at the Flamegraph of our optimized method. We have:</p>

<ol>
  <li><code class="language-plaintext highlighter-rouge">float.TryParse</code> - there is not really much we can do here without big changes</li>
  <li><code class="language-plaintext highlighter-rouge">StreamReader.ReadLine</code> - same as above</li>
</ol>

<p class="center"><img src="/images/profiling_ml_1/flame_optimized_span.png" alt="Optimized" /></p>

<p>Are we done here? NO! Two minutes to parse a 6 GB text file is still too much. What can we do when we can’t optimize the single-threaded code any further? We parallelize it!</p>

<h2 id="parallel">Parallel</h2>

<p>After we squeeze the single-threaded code we can parallelize it. With <code class="language-plaintext highlighter-rouge">System.Threading.Tasks.Parallel*</code> and <code class="language-plaintext highlighter-rouge">System.Collections.Concurrent*</code> it’s really easy!</p>

<p>Before (some parts omitted for brevity):</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="p">(</span><span class="n">StreamReader</span> <span class="n">sr</span> <span class="p">=</span> <span class="n">File</span><span class="p">.</span><span class="nf">OpenText</span><span class="p">(</span><span class="n">_modelFileNameWithPath</span><span class="p">))</span>
<span class="p">{</span>
    <span class="k">while</span> <span class="p">((</span><span class="n">line</span> <span class="p">=</span> <span class="n">sr</span><span class="p">.</span><span class="nf">ReadLine</span><span class="p">())</span> <span class="p">!=</span> <span class="k">null</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="p">=</span> <span class="n">LineParser</span><span class="p">.</span><span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="n">line</span><span class="p">);</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">isSuccess</span><span class="p">)</span>
            <span class="n">model</span><span class="p">.</span><span class="nf">AddWordVector</span><span class="p">(</span><span class="n">ch</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="n">values</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After (again some parts omitted for brevity):</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">var</span> <span class="n">parsedData</span> <span class="p">=</span> <span class="k">new</span> <span class="n">ConcurrentBag</span><span class="p">&lt;(</span><span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">,</span> <span class="kt">long</span> <span class="n">lineNumber</span><span class="p">)&gt;();</span>

<span class="n">Parallel</span><span class="p">.</span><span class="nf">ForEach</span><span class="p">(</span><span class="n">File</span><span class="p">.</span><span class="nf">ReadLines</span><span class="p">(</span><span class="n">_modelFileNameWithPath</span><span class="p">),</span>
    <span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="n">parallelState</span><span class="p">,</span> <span class="n">lineNumber</span><span class="p">)</span> <span class="p">=&gt;</span>
    <span class="p">{</span>
        <span class="p">(</span><span class="kt">bool</span> <span class="n">isSuccess</span><span class="p">,</span> <span class="kt">string</span> <span class="n">key</span><span class="p">,</span> <span class="kt">float</span><span class="p">[]</span> <span class="n">values</span><span class="p">)</span> <span class="p">=</span> <span class="n">LineParser</span><span class="p">.</span><span class="nf">ParseKeyThenNumbers</span><span class="p">(</span><span class="n">line</span><span class="p">);</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">isSuccess</span><span class="p">)</span>
            <span class="n">parsedData</span><span class="p">.</span><span class="nf">Add</span><span class="p">((</span><span class="n">key</span><span class="p">,</span> <span class="n">values</span><span class="p">,</span> <span class="n">lineNumber</span><span class="p">));</span>
    <span class="p">});</span>

<span class="k">foreach</span> <span class="p">(</span><span class="kt">var</span> <span class="n">parsedLine</span> <span class="k">in</span> <span class="n">parsedData</span><span class="p">.</span><span class="nf">OrderBy</span><span class="p">(</span><span class="n">parsedLine</span> <span class="p">=&gt;</span> <span class="n">parsedLine</span><span class="p">.</span><span class="n">lineNumber</span><span class="p">))</span>
    <span class="n">model</span><span class="p">.</span><span class="nf">AddWordVector</span><span class="p">(</span><span class="n">ch</span><span class="p">,</span> <span class="n">parsedLine</span><span class="p">.</span><span class="n">key</span><span class="p">,</span> <span class="n">parsedLine</span><span class="p">.</span><span class="n">values</span><span class="p">);</span>
</code></pre></div></div>

<p>And the new results with <strong>x3</strong> speedup:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CV_Multiclass_WikiDetox_WordEmbeddings_OVAAveragedPerceptron_JustParse</td>
        <td style="text-align: right">39.89 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p><strong>Important:</strong></p>

<ul>
  <li>I have used specialized Concurrent collection here that allows me to add items in a thread-safe way without locks. Adding manual synchronization would ruin the performance. Do use ConcurrentCollections, try to avoid using locks whenever you can!</li>
  <li>The order of lines is important so after processing entire file I am ordering the elements by the line number and then adding to the model. I did not know that, but the existing unit test reminded me of that very quickly ;)</li>
  <li>I did not want to create any extra memory pressure for the GC so I have used <code class="language-plaintext highlighter-rouge">ValueTuple</code> represented as <code class="language-plaintext highlighter-rouge">(bool isSuccess, string key, float[] values)</code>. <code class="language-plaintext highlighter-rouge">ValueTuple</code> is a Value Type, you can read my <a href="https://adamsitnik.com/Value-Types-vs-Reference-Types/">previous blog post</a> to learn more.</li>
</ul>

<h2 id="time-to-send-a-pr">Time to send a PR</h2>

<p>ML.NET is evolving very quickly over the time. I definitely don’t want to have a long living branch with many optimizations and solve merge conflicts every day, so I just create one PR per one optimization. A small PR is also easier to review. And if I introduce a new bug it’s easier to find the single change that caused it.</p>

<p>Before I create the PR I also remove the temporary benchmark for file parsing bottleneck. Other benchmarks include it in the execution path and it takes a lot of time to run it. I want to keep the benchmark suite focused on ML.NET, without duplicates and with a short time to run entire suite.</p>

<p>If your benchmarks suite contains duplicates and it takes a LOT of time to execute all of the benchmarks the developers stop using it. You need to keep it simple and focused. My personal recommendation is that it should not take longer than a lunch break to run all of the benchmarks. Why? If I apply some changes I just run the benchmarks before I go to lunch and when I am back I have the results.</p>

<p>If it’s hard to run the benchmarks or it takes too long you can’t expect the developers to run the benchmarks and care for performance.</p>

<h2 id="summary">Summary</h2>

<ol>
  <li>Do NOT try to guess what is the performance issue.</li>
  <li>DO write tests first to save time and avoid introducing new bugs.</li>
  <li>DO use a profiler to find out where the issues are.</li>
  <li>DO use benchmarks to measure the improvements and compare different approaches.</li>
  <li>Do NOT reinvent the wheel, .NET Framework most probably already have what you need.</li>
</ol>

<p>Before:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>WikiDetox_WordEmbeddings_OVAAveragedPerceptron</td>
        <td style="text-align: right">286.7 s</td>
      </tr>
      <tr>
        <td>WikiDetox_WordEmbeddings_SDCAMC</td>
        <td style="text-align: right">184.1 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>After:</p>

<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>WikiDetox_WordEmbeddings_OVAAveragedPerceptron</td>
        <td style="text-align: right">174.24 s</td>
      </tr>
      <tr>
        <td>WikiDetox_WordEmbeddings_SDCAMC</td>
        <td style="text-align: right">67.82 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>In my next blog post I am going to use Concurrency Visualizer for Visual Studio to continue this investigation until I get 100% CPU consumption on all Cores for processing this huge Utf8 text file.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction Part of my job on the .NET Team is to improve the performance of existing .NET libraries. My current goal is to identify performance bottlenecks in ML.NET and recognize common performance issues that should be addressed by .NET framework. In this blog post, I am describing how I approach sample performance problem using available free .NET tools and best practices for performance engineering.]]></summary></entry><entry><title type="html">Profiling .NET Code with BenchmarkDotNet</title><link href="https://adamsitnik.com/ETW-Profiler/" rel="alternate" type="text/html" title="Profiling .NET Code with BenchmarkDotNet" /><published>2018-09-28T00:00:00+00:00</published><updated>2018-09-28T00:00:00+00:00</updated><id>https://adamsitnik.com/ETW-Profiler</id><content type="html" xml:base="https://adamsitnik.com/ETW-Profiler/"><![CDATA[<h1 id="etw-profiler">ETW Profiler</h1>

<p>EtwProfiler is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of <code class="language-plaintext highlighter-rouge">0.11.2</code>. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with <a href="https://github.com/Microsoft/perfview">PerfView</a> or <a href="https://docs.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer">Windows Performance Analyzer</a>.</p>

<p><strong>Again with a single config!</strong>
<!--more--></p>

<h2 id="demo">Demo</h2>

<p>Following code is a real-world benchmark from the <a href="https://github.com/dotnet/machinelearning/blob/17ee205e585beb62777475af6d59cba816675eeb/test/Microsoft.ML.Benchmarks/Numeric/Ranking.cs#L36">ML.NET repository</a></p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">EtwProfiler</span><span class="p">(</span><span class="n">performExtraBenchmarksRun</span><span class="p">:</span> <span class="k">false</span><span class="p">)]</span> <span class="c1">// !!! use the new diagnoser!!</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">RankingTrain</span>
<span class="p">{</span>
    <span class="k">private</span> <span class="kt">string</span> <span class="n">_mslrWeb10k_Validate</span><span class="p">;</span>
    <span class="k">private</span> <span class="kt">string</span> <span class="n">_mslrWeb10k_Train</span><span class="p">;</span>

    <span class="p">[</span><span class="n">GlobalSetup</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">Setup</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">_mslrWeb10k_Validate</span> <span class="p">=</span> <span class="n">Path</span><span class="p">.</span><span class="nf">GetFullPath</span><span class="p">(</span><span class="n">TestDatasets</span><span class="p">.</span><span class="n">MSLRWeb</span><span class="p">.</span><span class="n">validFilename</span><span class="p">);</span>
        <span class="n">_mslrWeb10k_Train</span> <span class="p">=</span> <span class="n">Path</span><span class="p">.</span><span class="nf">GetFullPath</span><span class="p">(</span><span class="n">TestDatasets</span><span class="p">.</span><span class="n">MSLRWeb</span><span class="p">.</span><span class="n">trainFilename</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">FastTree</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">string</span> <span class="n">cmd</span> <span class="p">=</span> <span class="s">@"TrainTest test="</span> <span class="p">+</span> <span class="n">_mslrWeb10k_Validate</span> <span class="p">+</span>
            <span class="s">" eval=RankingEvaluator{t=10}"</span> <span class="p">+</span>
            <span class="s">" data="</span> <span class="p">+</span> <span class="n">_mslrWeb10k_Train</span> <span class="p">+</span>
            <span class="s">" loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138}"</span> <span class="p">+</span>
            <span class="s">" xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features}"</span> <span class="p">+</span>
            <span class="s">" tr=FastTreeRanking{}"</span><span class="p">;</span>

        <span class="k">using</span> <span class="p">(</span><span class="kt">var</span> <span class="n">environment</span> <span class="p">=</span> <span class="n">EnvironmentFactory</span><span class="p">.</span><span class="n">CreateRankingEnvironment</span><span class="p">&lt;</span><span class="n">RankerEvaluator</span><span class="p">,</span> <span class="n">TextLoader</span><span class="p">,</span> <span class="n">HashTransformer</span><span class="p">,</span> <span class="n">FastTreeRankingTrainer</span><span class="p">&gt;())</span>
        <span class="p">{</span>
            <span class="n">Maml</span><span class="p">.</span><span class="nf">MainCore</span><span class="p">(</span><span class="n">environment</span><span class="p">,</span> <span class="n">cmd</span><span class="p">,</span> <span class="n">alwaysPrintStacktrace</span><span class="p">:</span> <span class="k">false</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The regular output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.11.1.755-nightly, OS=Windows 10.0.17134.285 (1803/April2018Update/Redstone4)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
Frequency=3507505 Hz, Resolution=285.1029 ns, Timer=TSC
.NET Core SDK=2.2.100-preview2-009404
  [Host] : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT
  Dry    : .NET Core 2.1.4 (CoreCLR 4.6.26814.03, CoreFX 4.6.26814.02), 64bit RyuJIT

// * Diagnostic Output - EtwProfiler *
Exported 1 trace file(s). Example:
"C:\Projects\machinelearning\test\Microsoft.ML.Benchmarks\BenchmarkDotNet.Artifacts\Microsoft\ML\Benchmarks\RankingTrain\FastTree.etl"
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>FastTree</td>
        <td style="text-align: right">32.48 s</td>
        <td style="text-align: right">1.347 s</td>
        <td style="text-align: right">0.0761 s</td>
      </tr>
    </tbody>
  </table>

</div>

<p>And the new trace file opened with PerfView:</p>

<p class="center"><img src="/images/etwprofiler/flamegraph.png" alt="Flamegraph" /></p>

<h2 id="the-story">The Story</h2>

<p>Recently I have been working on porting all of the 3 000+ CoreFX and CoreCLR benchmarks from <a href="https://github.com/Microsoft/xunit-performance">xunit-performance</a> to BenchmarkDotNet. My job was to port all of the benchmarks, compare the results, fix the bugs and last but not least implement missing features. EtwProfiler is one of the things that were present in xunit-performance, but not in BenchmarkDotNet.</p>

<p>Initially I was sceptical about this idea because profiling running benchmark is an easy job, however with the amount of benchmarks we have, automating it was a must have.</p>

<p>And now I am very happy about the outcome!</p>

<h2 id="how-it-works">How it works</h2>

<p><code class="language-plaintext highlighter-rouge">EtwProfiler</code> uses <code class="language-plaintext highlighter-rouge">TraceEvent</code> library which internally uses Event Tracing for Windows (ETW) to capture stack traces and important .NET Runtime events.</p>

<p>Before the process with benchmarked code is started, EtwProfiler starts User and Kernel ETW sessions. Every session writes data to it’s own file and captures different data. User session listens for the .NET Runtime events (GC, JIT etc) while the Kernel session gets CPU stacks and Hardware Counter events. After this, the process with benchmarked code is started. During the benchmark execution all the data is captured and written to a trace file. Moreover, BenchmarkDotNet Engine emits it’s own events to be able to differentiate jitting, warmup, pilot and actual workload when analyzing the trace file. When the benchmarking is over, both sessions are closed and the two trace files are merged into one.</p>

<p>Stopping the sessions after process exit was very important because CLR emits all the symbol information as part of the CLR Rundown.</p>

<h2 id="limitations">Limitations</h2>

<p>What we have today comes with following limitations:</p>

<ul>
  <li>EtwProfiler works only on Windows (one day I might implement similar thing for Unix using EventPipe)</li>
  <li>Requires to run as Admin (to create ETW Kernel Session)</li>
  <li>No <code class="language-plaintext highlighter-rouge">InProcessToolchain</code> support</li>
  <li>To get the best possible managed code symbols you should configure your project in following way:</li>
</ul>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;DebugType&gt;</span>pdbonly<span class="nt">&lt;/DebugType&gt;</span>
<span class="nt">&lt;DebugSymbols&gt;</span>true<span class="nt">&lt;/DebugSymbols&gt;</span>
</code></pre></div></div>

<h2 id="how-to-use-it">How to use it?</h2>

<p>You need to install latest <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Diagnostics.Windows</code> package. It can be enabled in few ways, some of them:</p>

<ul>
  <li>Use the new attribute (apply it on a class that contains Benchmarks):</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">EtwProfiler</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">TheClassThatContainsBenchmarks</span> <span class="p">{</span> <span class="cm">/* benchmarks go here */</span> <span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Extend the <code class="language-plaintext highlighter-rouge">DefaultConfig.Instance</code> with new instance of <code class="language-plaintext highlighter-rouge">EtwProfiler</code>:</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> 
        <span class="p">=&gt;</span> <span class="n">BenchmarkSwitcher</span>
            <span class="p">.</span><span class="nf">FromAssembly</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">Program</span><span class="p">).</span><span class="n">Assembly</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">Run</span><span class="p">(</span><span class="n">args</span><span class="p">,</span>
                <span class="n">DefaultConfig</span><span class="p">.</span><span class="n">Instance</span>
                    <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="k">new</span> <span class="nf">EtwProfiler</span><span class="p">()));</span> <span class="c1">// HERE</span>
<span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Passing <code class="language-plaintext highlighter-rouge">-p ETW</code> or <code class="language-plaintext highlighter-rouge">--profiler ETW</code> command line arguments to <code class="language-plaintext highlighter-rouge">BenchmarkSwitcher</code></li>
</ul>

<h2 id="configuration">Configuration</h2>

<p>To configure the new diagnoser you need to create an instance of <code class="language-plaintext highlighter-rouge">EtwProfilerConfig</code> class and pass it to the <code class="language-plaintext highlighter-rouge">EtwProfiler</code> constructor. The parameters that <code class="language-plaintext highlighter-rouge">EtwProfilerConfig</code> ctor takes are:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">performExtraBenchmarksRun</code> - if set to true, benchmarks will be executed one more time with the profiler attached. If set to false, there will be no extra run but the results will contain overhead. True by default.</li>
  <li><code class="language-plaintext highlighter-rouge">bufferSizeInMb</code> - ETW session buffer size, in MB. 256 by default.</li>
  <li><code class="language-plaintext highlighter-rouge">cpuSampleIntervalInMiliseconds</code> - the rate at which CPU samples are collected. By default this is 1 (once a millisecond per CPU). There is a lower bound on this (typically 0.125 ms).</li>
  <li><code class="language-plaintext highlighter-rouge">intervalSelectors</code> - interval per harwdare counter, if not provided then default values will be used.</li>
  <li><code class="language-plaintext highlighter-rouge">kernelKeywords</code> - kernel session keywords, ImageLoad (for native stack frames) and Profile (for CPU Stacks) are the defaults.</li>
  <li><code class="language-plaintext highlighter-rouge">providers</code> - providers that should be enabled, if not provided then default values will be used.</li>
</ul>

<h2 id="using-perfview-to-work-with-trace-files">Using PerfView to work with trace files</h2>

<p>PerfView is a free .NET profiler from Microsoft. If you don’t know how to use it you should watch <a href="https://channel9.msdn.com/Series/PerfView-Tutorial">these instructional videos</a> first.</p>

<p>If you are familiar with PerfView, then the only thing you need to know is that BenchmarkDotNet performs Jitting by running the code once, Pilot Experiment to determine how many times benchmark should be executed per iteration, non-trivial Warmup and Actual Workload. This is why when you open your trace file in PerfView you will see your benchmark in a few different places of the StackTrace.</p>

<p class="center"><img src="/images/etwprofiler/flamegraph_not_filtered.png" alt="Nofilters" /></p>

<p>The simplest way to filter the data to the actual benchmarks runs is to open the <code class="language-plaintext highlighter-rouge">CallTree</code> tab, put “EngineActualStage” in the Find box, press enter and when PerfView selects <code class="language-plaintext highlighter-rouge">EngineActualStage</code> in the <code class="language-plaintext highlighter-rouge">CallTree</code> press <code class="language-plaintext highlighter-rouge">Alt+R</code> to Set Time Range.</p>

<p class="center"><img src="/images/etwprofiler/perfview.gif" alt="Filter" /></p>

<p>If you want to filter the trace to single iteration, then you must go to the Events panel and search for the <code class="language-plaintext highlighter-rouge">WorkloadActual/Start</code> and <code class="language-plaintext highlighter-rouge">WorkloadActual/Stop</code> events.</p>

<ol>
  <li>Open Events window</li>
  <li>Put “WorkloadActual” in the Filter box and hit enter.</li>
  <li>Press control or shift and choose the Start and Stop events from the left panel. Hit enter.</li>
  <li>Choose iteration that you want to investigate (events are sorted by time).</li>
  <li>Select two or more cells from the “Time MSec” column.</li>
  <li>Right click, choose “Open Cpu Stacks”.</li>
  <li>Choose the process with benchmarks, right-click, choose “Drill Into”</li>
</ol>

<p class="center"><img src="/images/etwprofiler/perfview_events.gif" alt="Filter" /></p>

<h3 id="special-thanks">Special Thanks</h3>

<p>I wanted to thank:</p>

<ul>
  <li>Jose Rivero who implemented this feature for xunit-performance and reviewed my code. I took a lot from his code.</li>
  <li>Brian Robbins for explaining me how CLR Rundown works.</li>
  <li>Vance Morrison for immediate release of TraceEvent with bug fixes in the area that was touching private Windows APIs.</li>
  <li>Andrey Akinshin for reviewing the PR and pushing me to write the docs. Without Andrey I would not write this blog post ;)</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[ETW Profiler EtwProfiler is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of 0.11.2. It allows to profile the benchmarked .NET code on Windows and exports the data to a trace file which can be opened with PerfView or Windows Performance Analyzer. Again with a single config!]]></summary></entry><entry><title type="html">My way of Conducting an Interview</title><link href="https://adamsitnik.com/Conducting-Interview/" rel="alternate" type="text/html" title="My way of Conducting an Interview" /><published>2018-09-03T00:00:00+00:00</published><updated>2018-09-03T00:00:00+00:00</updated><id>https://adamsitnik.com/Conducting-Interview</id><content type="html" xml:base="https://adamsitnik.com/Conducting-Interview/"><![CDATA[<p>Interviewing people is not an easy job to do. You want to find the person which is going to get things done, enjoy working with given project, fit into the team and be happy about the money you can offer.</p>

<p>As an interviewer, you are also being judged by the candidate. You very often create the first impression of the company. So you also need to make a good impression. Nobody wants to work with mean or incompetent people!</p>

<p>In this blog post, I am describing my way of conducting the interview. In my career, I have interviewed a hundred developers and hired over a dozen of them. So my experience is not very reach, it’s limited to “my sample”.</p>

<p><strong>Disclaimer</strong>: After joining Microsoft I don’t interview candidates anymore. This post is my personal approach build upon the experience prior to joining MS.</p>

<p>I hope that my experience can help somebody to improve the interviewing process!</p>

<!--more-->

<h2 id="evolution">Evolution</h2>

<p>My interviewing style has evolved over the years. Initially, I was focused on asking very strict technical questions. Some examples:</p>

<ul>
  <li>what are the differences between Value and Reference Types?</li>
  <li>how GC works?</li>
  <li>what is the difference between <code class="language-plaintext highlighter-rouge">DROP TABLE</code> and <code class="language-plaintext highlighter-rouge">TRUNCATE TABLE</code>?</li>
</ul>

<p><strong>But I very soon realized that the fact that somebody can answer similar questions does not mean that she or he can get things done.</strong></p>

<p><strong>Also, the fact that somebody does not know the answers does not mean that he or she can’t search for them and learn fast when needed.</strong></p>

<h2 id="homework">Homework</h2>

<p>Before I start interviewing I do the homework: <strong>I read the candidate CV</strong>, mark the things that I want to talk about. If I don’t read the CV before the interview, and during the interview, I am surprised about things that were stated in the CV it is just a <strong>disrespect</strong>.</p>

<blockquote>
  <p>Interviewer: I did not know that you are not graduated in Computer Science.<br />
Candidate: But I have described my education in the resume.<br />
Next: An awkward moment of silence.</p>
</blockquote>

<p><strong>Find out as much as possible about the project that you are interviewing for.</strong>
Is it some kind of a rocket science? Or maintenance? Or simple CRUD?
What technology?
Does it require a lot of travel?</p>

<p>You need to find a person that is going to fit the project and the team. If you know too little about the project it’s going to be hard or impossible. <strong>Be prepared!</strong></p>

<h2 id="relax">Relax!</h2>

<p>Candidates are typically very nervous at the beginning of the interview. If you start asking hard questions to a person who barely breaths and just wants to run away from the room you won’t get good answers.</p>

<p>So as an interviewer I always focus first on chilling out the candidate. I start with some Chit Chat about some positive things. An example:</p>

<blockquote>
  <p>The weather outside sucks. I need to go for a holiday to recharge. What was your recent holiday destination?</p>
</blockquote>

<p>After that, we might talk for a few minutes about holidays. The candidate just needs to start talking!</p>

<p>If the candidate has not been on holidays for years you can say what the company has to offer. An example:</p>

<blockquote>
  <p>We offer 30 days of fully paid holidays. We develop products for our internal purpose, so there are no super-strict deadlines and you can take a week off anytime you want to.</p>
</blockquote>

<p>I keep the conversation positive and informal. I continue to the next stage when I feel that the candidate is not nervous anymore.</p>

<p><em>And yes my dear US readers, 30 or even 35 days of fully paid holidays is totally possible in Europe. The same goes for unlimited sick days.</em></p>

<h2 id="warmup">Warmup</h2>

<p>In the beginning, I ask some simple, but very important questions. I also say it loud and clear that I am searching for a good fit for the project and I am expecting honest answers.</p>

<ol>
  <li>What’s your favorite thing about programming?</li>
  <li>What are the things about programming that you don’t like?</li>
  <li>Could you describe your dream job?</li>
</ol>

<p>Some candidates say that they can do anything. In such case, I ask if they would be happy to debug some old Java scripts in Oracle Bus or migrate a relational database with no documentation to NoSQL cloud database in two weeks. I need to make sure that they understand that I am asking these questions to avoid putting them to a project they are not going to enjoy.</p>

<p>The answers help me to understand if given person can be a good fit for the project and the position that I am recruiting for.</p>

<p>If the candidate is a good software engineer but not a good fit for this particular project I offer a different project or just stay in touch. Otherwise, if I hire that person then she or he will not be satisfied and probably just leave. Everybody is going to lose a lot of time, and the company a lot of money. <strong>Don’t be selfish, think future-wise.</strong></p>

<p>The funniest answer I ever got:</p>

<blockquote>
  <p>Me: Could you describe your dream job?<br />
Candidate: I would like to be a Team Leader.<br />
Me: Why?<br />
Candidate: Because I would like to make important decisions.<br />
Another interviewer: Would you also like to keep the team motivated, help with planning and estimations?<br />
Candidate: No. Just making the important decisions.</p>
</blockquote>

<h2 id="learning">Learning</h2>

<p>It’s very hard to find a perfect candidate that is familiar with specific product and technology. Moreover, the requirements change over the time so in my opinion the most important thing is the possibility to learn. So I ask a LOT about learning.</p>

<ol>
  <li>What is your favorite way of learning new things?</li>
  <li>When was the last time you learned something new? What was it? How did you apply it at work?</li>
  <li>Do you have some gurus? Who are they and why do you value them?</li>
  <li>When was the last time when you did not agree with some blog post/book/video? What was that and why you did not agree?</li>
  <li>When was the last time you have changed one of your programming habits? What was that and why?</li>
  <li>When was the last time when you shared the knowledge with somebody else? Did you like it? Why did you do that?</li>
</ol>

<p>The answers help me to understand if given person enjoys learning new things and is self-motivated or needs a manager to tell him/her to read a book. I also want to understand the learning process, validation of the information and using it in practice.</p>

<p>This is also the moment when good software engineers are fully relaxed and enjoy the conversation. So I can move on to the next part.</p>

<h2 id="problem-solving-and-decision-making">Problem solving and decision making</h2>

<p>If I got here it means that the candidate can be a good fit and enjoys learning new things. So what I need to check is problem-solving and decision making. Of course, it’s impossible to fully test that during the interview, but I can at least try.</p>

<p>Here, I start with something like: <em>Could you tell me about your last big assignment that you were working on? What were you supposed to implement, what were the steps that you took and which technology did you choose? Keep the company secrets for yourself, I just want to understand the way you approach and solve problems.</em></p>

<p>From here I let the candidate talk and later I ask a lot of questions:</p>

<ol>
  <li>Why did you have to do that?</li>
  <li>How did you start the task? Did you do some research? Did you talk with the customer?</li>
  <li>Which technology and/or components did you choose and why?</li>
  <li>When did you write tests?</li>
  <li>Was there anything that you could do better?</li>
</ol>

<p><strong>I ask as many questions as needed. Here I very often go deep into the technical details.</strong> I ask deep technical questions related to the tools and compontents used by the candidate to understand if candidate knows how they work. If the candidate does not answer some questions I say something like: <em>Don’t be afraid, it’s perfectly fine that you don’t know something. I am just testing your knowledge. I don’t expect you to know everything.</em> Of course, sometimes it’s just a lie when given person fails to answer a simple question. It’s important to not freak out the candidate!!</p>

<p>I want to hire people who want to understand problems before solving them. Who choose the best tool for given problem. Who write tests first to make sure everything works fine and the problems don’t come back. Those, who understand how the technology they use works.</p>

<p>Very often the initial answer is very promising but the problem-solving skills are bad. One of such of my interviews:</p>

<blockquote>
  <p>Me: Could you tell me about your last big assignment that you were working on?<br />
Candidate: I have been working on improving the performance of our critical system. I have improved it by 20%. (sounds cool!)<br />
Me: how did you start? Did you do some research?<br />
Candidate: I instrumented every method call in the code with stopwatch and logged the output to the logs.<br />
Me: How did you do that?<br />
Candidate: I manually added stopwatch to every public method.<br />
Me: Why you did not use a profiler? Did you run out of licenses?<br />
Candidate: What is a profiler?<br />
Me: Nothing important, I was just curious. (lie)<br />
Me: How did you found out which methods were taking most of the time?<br />
Candidate: I just read all of the log files.<br />
Me: In which environment were you testing the performance?<br />
Candidate: I copy pasted the instrumented dlls to a live system of one of our customers who was complaining about perf.<br />
Me: Did you create a backup of the app first?<br />
Candidate: Yes (I guess it was a lie too).</p>
</blockquote>

<p>So given candidate somehow managed to solve the problem, but did not use the right tool and moreover did not perform the right research. It took a lot of time and risked issues at production.</p>

<p>When a smart developer faces a new problem then she or he starts the web browser and performs a search. Chooses the right tool and gets the job done. If the problem is more complex it might require reading a book, discussing it with other teammates, architect or tech lead.</p>

<h2 id="failure">Failure</h2>

<p>This is simple. I just ask about the most recent failure. What was that? Why did it happen? What did you learn from it? What did you do to make sure the problem does not occur again?</p>

<p>I want to hire people who can acknowledge that they did something wrong and learn from their own mistakes. Nobody wants to work with narcissists.</p>

<p>People are typically afraid to answer this question so I encourage them by talking about my recent failure. This helps a lot and opens most of the candidates.</p>

<h2 id="my-recent-assignment">My recent assignment</h2>

<p>I never ask the candidates to write code during the interview, instead of that I describe them my recent assignment and ask how they would solve it. I ask about my real task because I want to check how they would approach a real problem. I also know a lot about possible solutions because I always do a good research. It also helps me to make every interview unique.</p>

<p>When the candidate answers: <em>I would just google for it</em> I give my laptop to the candidate and let her/him search and read for a few minutes. I check what they put to the search engine. Most of the candidates are very surprised when I do that. But I am searching for people who can solve new problems, not artifical tasks from <em>Cracking the Coding Interview</em> book. And solving new problems typically requires to perform some web search.</p>

<h2 id="money">Money</h2>

<p>When it comes to the money, I never offer a lowball salary or overpay when compared to the other team members. People talk to each other, do the math or web search and once they find out that they earn much less than others on the same level they get angry and leave. The company is losing a lot of money, more than lowball offer could save. It’s just a matter of time!</p>

<h2 id="summary">Summary</h2>

<ol>
  <li>Talk about something positive and not related to interviewing to relax the candidate.</li>
  <li>Ask about what the candidate really wants to do and is not willing to do. Is given position and project a good fit?</li>
  <li>Ask about learning because learning is the fundamental part of being a software engineer.</li>
  <li>Ask many questions about recent assignment to find out how given candidate approaches problems.</li>
  <li>Ask about a failure to make sure you filter out the narcists.</li>
  <li>Ask about your recent assignment to find out how given person would solve a real problem related to the job that you are interviewing for.</li>
</ol>

<p>Good luck with your interviews!!</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Interviewing people is not an easy job to do. You want to find the person which is going to get things done, enjoy working with given project, fit into the team and be happy about the money you can offer. As an interviewer, you are also being judged by the candidate. You very often create the first impression of the company. So you also need to make a good impression. Nobody wants to work with mean or incompetent people! In this blog post, I am describing my way of conducting the interview. In my career, I have interviewed a hundred developers and hired over a dozen of them. So my experience is not very reach, it’s limited to “my sample”. Disclaimer: After joining Microsoft I don’t interview candidates anymore. This post is my personal approach build upon the experience prior to joining MS. I hope that my experience can help somebody to improve the interviewing process!]]></summary></entry><entry><title type="html">Disassembling .NET Code with BenchmarkDotNet</title><link href="https://adamsitnik.com/Disassembly-Diagnoser/" rel="alternate" type="text/html" title="Disassembling .NET Code with BenchmarkDotNet" /><published>2017-08-16T00:00:00+00:00</published><updated>2017-08-16T00:00:00+00:00</updated><id>https://adamsitnik.com/Disassembly-Diagnoser</id><content type="html" xml:base="https://adamsitnik.com/Disassembly-Diagnoser/"><![CDATA[<h1 id="disassembly-diagnoser">Disassembly Diagnoser</h1>

<p>Disassembly Diagnoser is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of <code class="language-plaintext highlighter-rouge">0.10.10</code>. It allows to disassemble the benchmarked .NET code:</p>

<ul>
  <li>to ASM:
    <ul>
      <li>desktop .NET: LegacyJit (32 &amp; 64 bit), RyuJIT (64 bit)</li>
      <li>.NET Core 1.1+ (<strong>including .NET Core 2.0</strong>) for RyuJIT (64 bit)</li>
      <li>Mono: 32 &amp; 64 bit, <strong>including LLVM</strong></li>
    </ul>
  </li>
  <li>to IL and corresponding C# code:
    <ul>
      <li>desktop .NET: LegacyJit (32 &amp; 64 bit), RyuJIT (64 bit)</li>
      <li>.NET Core: 1.1+ (including .NET Core 2.0)</li>
    </ul>
  </li>
</ul>

<p><strong>With a single config!</strong>
<!--more--></p>

<h2 id="demo">Demo</h2>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">DisassemblyDiagnoser</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">printSource</span><span class="p">:</span> <span class="k">true</span><span class="p">)]</span> <span class="c1">// !!! use the new diagnoser!!</span>
<span class="p">[</span><span class="n">RyuJitX64Job</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">Simple</span>
<span class="p">{</span>
    <span class="kt">int</span><span class="p">[]</span> <span class="n">field</span> <span class="p">=</span> <span class="n">Enumerable</span><span class="p">.</span><span class="nf">Range</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">100</span><span class="p">).</span><span class="nf">ToArray</span><span class="p">();</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">SumLocal</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">field</span><span class="p">;</span> <span class="c1">// we use local variable that points to the field</span>

        <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
            <span class="n">sum</span> <span class="p">+=</span> <span class="n">local</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>

        <span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">SumField</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">field</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
            <span class="n">sum</span> <span class="p">+=</span> <span class="n">field</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>

        <span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The regular output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]    : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  RyuJitX64 : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0

Job=RyuJitX64  Jit=RyuJit  Platform=X64  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>SumLocal</td>
        <td style="text-align: right"><strong>78.27</strong> ns</td>
        <td style="text-align: right">0.6818 ns</td>
        <td style="text-align: right">0.6377 ns</td>
      </tr>
      <tr>
        <td>SumField</td>
        <td style="text-align: right">79.24 ns</td>
        <td style="text-align: right">0.3923 ns</td>
        <td style="text-align: right">0.3670 ns</td>
      </tr>
    </tbody>
  </table>

</div>

<p>And the new disassembly output:</p>

<p class="center"><img src="/images/disasm/simpleDemo.png" alt="Disassembly" /></p>

<p>As you can see very similar C# code produces different assembly code which has different performance characteristics. The main goal of disassembly diagnoser is to allow the BenchmarkDotNet users to do an easy comparison of generated assembly code.</p>

<h2 id="the-story">The Story</h2>

<p>I wanted to develop this feature for a long time. Many people asked for it but I simply did not have any spare time. This was about to change.</p>

<p>I was getting back from an awesome <a href="https://www.wug.cz/praha/akce/951--Net-TechTalks">.NET Meetup</a> organized by Karel Zikmund in Prague and I had few spare hours between the checkout from the hotel and my flight. I decided to write a simple PoC and see how it goes.</p>

<p>Initially, I had no idea where to start. Most of the people use the Disassembly window from Visual Studio. But VS is closed-source so I could only use it for validation of my results. The other option was WinDbg. Getting the disassembly with WinDb is <a href="https://bret.codes/net-core-and-windbg/">non-trivial</a>. And it’s closed-source as well.</p>

<p>I have almost forgotten that Matt Warren did something <a href="https://github.com/dotnet/BenchmarkDotNet/issues/53">very similar</a> to this for BenchmarkDotNet a long time ago. I have started analysing his code, which <a href="https://github.com/Microsoft/clrmd/issues/34#issuecomment-256304015">led</a> me to <a href="https://github.com/goldshtn/msos">msos</a> by Sasha Goldshtein. And msos by Sasha was exactly what I needed. The credit for super smart disassembling goes to Sasha. I just took his code, tweaked it a little and extended. I have  found and fixed some bugs in <a href="https://github.com/goldshtn/msos/pull/71">msos</a> and <a href="https://github.com/Microsoft/clrmd/pull/83">ClrMD</a>. So all sides benefit from being OSS ;)</p>

<h1 id="how-it-works">How it works?</h1>

<p>As some of you might know in BenchmarkDotNet we have the host process (what you run in the console) and child process, which executes the benchmark and reports results back to the host. The child process is generated, compiled and executed by the host. With such architecture, we can benchmark given .NET code for any config desired by the user (any JIT, any .NET framework, any GC configuration). Last but not least it helps to make the results more stable. GC is self-tuning and JIT can make some extra optimizations, but with process per benchmark, you always get the clean score.</p>

<h2 id="desktop-net">Desktop .NET</h2>

<p>Based on the idea from msos the host is using ClrMD to attach to the child process. ClrMD allows us to get the text representation of assembly code. To get the IL we use the one and only Mono.Cecil. To get the corresponding C# code we once again use ClrMD.</p>

<p>ClrMD can attach to the process of the same bitness. To support all scenarios (host 32bit, child 64bit and the opposite) I have put the disassembler to a separate process. This is why we have <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Disassembler.x86.exe</code> and <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Disassembler.x64.exe</code>. Both disassemblers are stored in the resources of the <code class="language-plaintext highlighter-rouge">BenchmarkDotNet.Core.dll</code>. When the time comes, they are copied from resources to the hard drive and executed accordingly.</p>

<h2 id="net-core">.NET Core</h2>

<p>The NuGet package of <a href="https://www.nuget.org/packages/Microsoft.Diagnostics.Runtime/">ClrMD</a> implements .NET Core support but targets only desktop .NET. It’s not a problem because we can use our architecture to get it running for .NET Core. Whatever the host is (.NET or .NET Core) it spawns the disassembler process (a desktop .NET process) which uses ClrMD to attach to the child .NET Core process.</p>

<p>This is why we currently support only Windows for our .NET Core disassembler.</p>

<h2 id="mono">Mono</h2>

<p>With great help from Miguel de Icaza, I was able to implement a simple disassembler for Mono. We just run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mono -v -v -v -v --compile $namespace.$typeName:$methodName $exeName
</code></pre></div></div>

<p>and parse the output. The LLVM is supported and you don’t need to install anything except BenchmarkDotNet. The downside is that as of now the parser can handle only simple benchmarks. I did not have the time to test all edge cases.</p>

<h2 id="limitations">Limitations</h2>

<p>What we have today comes with following limitations:</p>

<ul>
  <li>.NET Core disassembler works only on Windows</li>
  <li>Mono disassembler does not support recursive disassembling and produces output without IL and C#.</li>
  <li>Indirect calls are not tracked.</li>
  <li>To be able to compare different platforms, you need to target AnyCPU <code class="language-plaintext highlighter-rouge">&lt;PlatformTarget&gt;AnyCPU&lt;/PlatformTarget&gt;</code></li>
  <li>To get the corresponding C#/F# code from disassembler you need to configure your project in following way:</li>
</ul>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;DebugType&gt;</span>pdbonly<span class="nt">&lt;/DebugType&gt;</span>
<span class="nt">&lt;DebugSymbols&gt;</span>true<span class="nt">&lt;/DebugSymbols&gt;</span>
</code></pre></div></div>

<h1 id="how-to-use-it">How to use it?</h1>

<p>The first step is to install <code class="language-plaintext highlighter-rouge">BenchmarkDotNet</code> version <code class="language-plaintext highlighter-rouge">0.10.10</code> or newer (always use latest BenchmarkDotNet for your own good!).</p>

<p>After this you need to apply following settings to your <code class="language-plaintext highlighter-rouge">.csproj</code> file:</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;PropertyGroup&gt;</span>
  <span class="nt">&lt;PlatformTarget&gt;</span>AnyCPU<span class="nt">&lt;/PlatformTarget&gt;</span>
  <span class="nt">&lt;DebugType&gt;</span>pdbonly<span class="nt">&lt;/DebugType&gt;</span>
  <span class="nt">&lt;DebugSymbols&gt;</span>true<span class="nt">&lt;/DebugSymbols&gt;</span>
<span class="nt">&lt;/PropertyGroup&gt;</span>
</code></pre></div></div>

<p>Now you can enable it in two ways:</p>

<ul>
  <li>Use the new attribute (apply it on a class that contains Benchmarks):</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">DisassemblyDiagnoser</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">printSource</span><span class="p">:</span> <span class="k">true</span><span class="p">)]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">TheClassThatContainsBenchmarks</span> <span class="p">{</span> <span class="cm">/* benchmarks go here */</span> <span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Tell your custom config to use it:</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">private</span> <span class="k">class</span> <span class="nc">CustomConfig</span> <span class="p">:</span> <span class="n">ManualConfig</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">CustomConfig</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Default</span><span class="p">);</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">DisassemblyDiagnoser</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="k">new</span> <span class="nf">DisassemblyDiagnoserConfig</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">recursiveDepth</span><span class="p">:</span> <span class="m">1</span><span class="p">)));</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="recursive-mode">Recursive mode</h2>

<p>The new diagnoser supports recursive disassembling. It means that you can configure it to disassemble the benchmark itself and optionally the code that it calls. To do so you need to use the <code class="language-plaintext highlighter-rouge">recursiveDepth</code> parameter. Be careful with setting it to <code class="language-plaintext highlighter-rouge">int.MaxValue</code>.  If you are curious, please try it for following benchmark:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">void</span> <span class="nf">Big</span><span class="p">()</span>
<span class="p">{</span>
   <span class="k">if</span><span class="p">(</span><span class="k">new</span> <span class="nf">Random</span><span class="p">(</span><span class="m">123</span><span class="p">).</span><span class="nf">Next</span><span class="p">(</span><span class="m">5</span><span class="p">,</span> <span class="m">10</span><span class="p">)</span> <span class="p">&gt;</span> <span class="m">11</span><span class="p">)</span>
       <span class="k">throw</span> <span class="k">new</span> <span class="nf">InvalidOperationException</span><span class="p">(</span><span class="s">"Impossible"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Spoiler: it produces a 50 MB file ;)</p>

<h2 id="single-config-for-all-jits">Single config for ALL JITs</h2>

<p>You can use a single config to compare the generated assembly code for ALL JITs.</p>

<p>But to allow benchmarking any target platform architecture the project which defines benchmarks has to target <strong>AnyCPU</strong>.</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;PropertyGroup&gt;</span>
  <span class="nt">&lt;PlatformTarget&gt;</span>AnyCPU<span class="nt">&lt;/PlatformTarget&gt;</span>
<span class="nt">&lt;/PropertyGroup&gt;</span>
</code></pre></div></div>

<p>Let’s check the Devirtualization that was <a href="https://blogs.msdn.microsoft.com/dotnet/2017/06/29/performance-improvements-in-ryujit-in-net-core-and-net-framework/">introduced recently</a> for .NET Core 2.0:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">MultipleJits</span> <span class="p">:</span> <span class="n">ManualConfig</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">MultipleJits</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="k">new</span> <span class="nf">MonoRuntime</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="s">"Mono x86"</span><span class="p">,</span> <span class="n">customPath</span><span class="p">:</span> <span class="s">@"C:\Program Files (x86)\Mono\bin\mono.exe"</span><span class="p">)).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X86</span><span class="p">));</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="k">new</span> <span class="nf">MonoRuntime</span><span class="p">(</span><span class="n">name</span><span class="p">:</span> <span class="s">"Mono x64"</span><span class="p">,</span> <span class="n">customPath</span><span class="p">:</span> <span class="s">@"C:\Program Files\Mono\bin\mono.exe"</span><span class="p">)).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">LegacyJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X86</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Clr</span><span class="p">));</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">LegacyJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Clr</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">RyuJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Clr</span><span class="p">));</span>

        <span class="c1">// RyuJit for .NET Core 1.1</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">RyuJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Core</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp11</span><span class="p">));</span>

        <span class="c1">// RyuJit for .NET Core 2.0</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">ShortRun</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">RyuJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Core</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp20</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">DisassemblyDiagnoser</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="k">new</span> <span class="nf">DisassemblyDiagnoserConfig</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">printPrologAndEpilog</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">recursiveDepth</span><span class="p">:</span> <span class="m">3</span><span class="p">)));</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="p">[</span><span class="nf">Config</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">MultipleJits</span><span class="p">))]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">Jit_Devirtualization</span>
<span class="p">{</span>
    <span class="k">private</span> <span class="n">Increment</span> <span class="n">increment</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Increment</span><span class="p">();</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">CallVirtualMethod</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">increment</span><span class="p">.</span><span class="nf">OperateTwice</span><span class="p">(</span><span class="m">10</span><span class="p">);</span>

    <span class="k">public</span> <span class="k">abstract</span> <span class="k">class</span> <span class="nc">Operation</span>  <span class="c1">// abstract unary integer operation</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="k">abstract</span> <span class="kt">int</span> <span class="nf">Operate</span><span class="p">(</span><span class="kt">int</span> <span class="n">input</span><span class="p">);</span>

        <span class="k">public</span> <span class="kt">int</span> <span class="nf">OperateTwice</span><span class="p">(</span><span class="kt">int</span> <span class="n">input</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="nf">Operate</span><span class="p">(</span><span class="nf">Operate</span><span class="p">(</span><span class="n">input</span><span class="p">));</span> <span class="c1">// two virtual calls to Operate</span>
    <span class="p">}</span>

    <span class="k">public</span> <span class="k">sealed</span> <span class="k">class</span> <span class="nc">Increment</span> <span class="p">:</span> <span class="n">Operation</span> <span class="c1">// concrete, sealed operation: increment by fixed amount</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="k">readonly</span> <span class="kt">int</span> <span class="n">Amount</span><span class="p">;</span>
        <span class="k">public</span> <span class="nf">Increment</span><span class="p">(</span><span class="kt">int</span> <span class="n">amount</span> <span class="p">=</span> <span class="m">1</span><span class="p">)</span> <span class="p">{</span> <span class="n">Amount</span> <span class="p">=</span> <span class="n">amount</span><span class="p">;</span> <span class="p">}</span>

        <span class="k">public</span> <span class="k">override</span> <span class="kt">int</span> <span class="nf">Operate</span><span class="p">(</span><span class="kt">int</span> <span class="n">input</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">input</span> <span class="p">+</span> <span class="n">Amount</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The results:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  Job-UBMWVM : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.2053.0;compatjit-v4.7.2053.0
  Job-JDGXXX : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2053.0
  Job-PXPXXE : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  Job-DULNTX : .NET Core 1.1.2 (Framework 4.6.25211.01), 64bit RyuJIT
  Job-GAPDXO : .NET Core 2.0.0 (Framework 4.6.00001.0), 64bit RyuJIT
  Job-ZXJTYF : Mono 4.4.1 (Visual Studio), 64bit 
  Job-NBVNXQ : Mono 5.2.0 (Visual Studio), 32bit 

LaunchCount=1  TargetCount=3  WarmupCount=3  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th>Runtime</th>
        <th>Toolchain</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CallVirtualMethod</td>
        <td>LegacyJit</td>
        <td>X64</td>
        <td>Clr</td>
        <td>Default</td>
        <td style="text-align: right">3.222 ns</td>
        <td style="text-align: right">0.2984 ns</td>
        <td style="text-align: right">0.0169 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td>Clr</td>
        <td>Default</td>
        <td style="text-align: right">3.012 ns</td>
        <td style="text-align: right">0.3651 ns</td>
        <td style="text-align: right">0.0206 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>Clr</td>
        <td>Default</td>
        <td style="text-align: right">2.928 ns</td>
        <td style="text-align: right">0.2941 ns</td>
        <td style="text-align: right">0.0166 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>Core</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">2.920 ns</td>
        <td style="text-align: right">0.1688 ns</td>
        <td style="text-align: right">0.0095 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>Core</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right"><strong>2.222</strong> ns</td>
        <td style="text-align: right">0.6163 ns</td>
        <td style="text-align: right">0.0348 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>Mono x64</td>
        <td>Default</td>
        <td style="text-align: right">5.114 ns</td>
        <td style="text-align: right">0.5626 ns</td>
        <td style="text-align: right">0.0318 ns</td>
      </tr>
      <tr>
        <td>CallVirtualMethod</td>
        <td>RyuJit</td>
        <td>X86</td>
        <td>Mono x86</td>
        <td>Default</td>
        <td style="text-align: right">9.610 ns</td>
        <td style="text-align: right">0.2672 ns</td>
        <td style="text-align: right">0.0151 ns</td>
      </tr>
    </tbody>
  </table>

</div>

<p>The disassembly result can be obtained <a href="https://adamsitnik.com/files/disasm/Jit_Devirtualization-disassembly-report.html">here</a>. The file was too big to embed it in this blog post.</p>

<h2 id="getting-only-the-disassembly-without-running-the-benchmarks-for-a-long-time">Getting only the Disassembly without running the benchmarks for a long time</h2>

<p>Sometimes you might be interested only in the disassembly, not the results of the benchmarks. In that case you can use <strong>Job.Dry</strong> which runs the benchmark only <strong>once</strong>.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">JustDisassembly</span> <span class="p">:</span> <span class="n">ManualConfig</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">JustDisassembly</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Dry</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">RyuJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Core</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp20</span><span class="p">));</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Dry</span><span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">Jit</span><span class="p">.</span><span class="n">RyuJit</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Platform</span><span class="p">.</span><span class="n">X64</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">Runtime</span><span class="p">.</span><span class="n">Core</span><span class="p">).</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp21</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">DisassemblyDiagnoser</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="k">new</span> <span class="nf">DisassemblyDiagnoserConfig</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">printPrologAndEpilog</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">recursiveDepth</span><span class="p">:</span> <span class="m">3</span><span class="p">)));</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h1 id="the-ultimate-combination">The Ultimate Combination</h1>

<p>Some time ago I have implemented <a href="https://adamsitnik.com/Hardware-Counters-Diagnoser/">Hardware Counters</a> diagnoser for BenchmarkDotNet. Ever since then I wanted to combine the Instruction Pointers that comes with the events with the code.</p>

<p>Now it was finally possible. ClrMD gives me the asm with IPs, ETW gives me hardware counters with IPs. That’s all I need.</p>

<p>Let’s use both diagnosers to answer the famous <em>“<a href="https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array">Why is it faster to process a sorted array than an unsorted array?
</a></em>”.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">BenchmarkRunner</span><span class="p">.</span><span class="n">Run</span><span class="p">&lt;</span><span class="n">Cpu_BranchPerdictor</span><span class="p">&gt;();</span>
<span class="p">}</span>

<span class="p">[</span><span class="nf">HardwareCounters</span><span class="p">(</span><span class="n">HardwareCounter</span><span class="p">.</span><span class="n">BranchMispredictions</span><span class="p">,</span> <span class="n">HardwareCounter</span><span class="p">.</span><span class="n">BranchInstructions</span><span class="p">)]</span>
<span class="p">[</span><span class="nf">DisassemblyDiagnoser</span><span class="p">(</span><span class="n">printAsm</span><span class="p">:</span> <span class="k">true</span><span class="p">,</span> <span class="n">printSource</span><span class="p">:</span> <span class="k">true</span><span class="p">)]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">Cpu_BranchPerdictor</span>
<span class="p">{</span>
    <span class="k">private</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">N</span> <span class="p">=</span> <span class="m">32767</span><span class="p">;</span>
    <span class="k">private</span> <span class="k">readonly</span> <span class="kt">int</span><span class="p">[]</span> <span class="n">sorted</span><span class="p">,</span> <span class="n">unsorted</span><span class="p">;</span>

    <span class="k">public</span> <span class="nf">Cpu_BranchPerdictor</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">random</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Random</span><span class="p">(</span><span class="m">0</span><span class="p">);</span>
        <span class="n">unsorted</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">int</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
        <span class="n">sorted</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">int</span><span class="p">[</span><span class="n">N</span><span class="p">];</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
            <span class="n">sorted</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="n">unsorted</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="n">random</span><span class="p">.</span><span class="nf">Next</span><span class="p">(</span><span class="m">256</span><span class="p">);</span>
        <span class="n">Array</span><span class="p">.</span><span class="nf">Sort</span><span class="p">(</span><span class="n">sorted</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="k">static</span> <span class="kt">int</span> <span class="nf">Branch</span><span class="p">(</span><span class="kt">int</span><span class="p">[]</span> <span class="n">data</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">N</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">&gt;=</span> <span class="m">128</span><span class="p">)</span>
                <span class="n">sum</span> <span class="p">+=</span> <span class="n">data</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
        <span class="k">return</span> <span class="n">sum</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">SortedBranch</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="nf">Branch</span><span class="p">(</span><span class="n">sorted</span><span class="p">);</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">UnsortedBranch</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="nf">Branch</span><span class="p">(</span><span class="n">unsorted</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The results:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.9.281-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338344 Hz, Resolution=427.6531 ns, Timer=TSC
  [Host]     : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
  DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2053.0
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
        <th style="text-align: right">Mispredict rate</th>
        <th style="text-align: right">BranchInstructions/Op</th>
        <th style="text-align: right">BranchMispredictions/Op</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>SortedBranch</td>
        <td style="text-align: right">21.15 us</td>
        <td style="text-align: right">0.0550 us</td>
        <td style="text-align: right">0.0488 us</td>
        <td style="text-align: right">0,11%</td>
        <td style="text-align: right">61712</td>
        <td style="text-align: right">65</td>
      </tr>
      <tr>
        <td>UnsortedBranch</td>
        <td style="text-align: right">135.32 us</td>
        <td style="text-align: right">0.7503 us</td>
        <td style="text-align: right">0.7018 us</td>
        <td style="text-align: right">21,90%</td>
        <td style="text-align: right">80158</td>
        <td style="text-align: right">17555</td>
      </tr>
    </tbody>
  </table>

</div>

<p>The new report:</p>

<p class="center"><img src="/images/disasm/hardwareCounters.png" alt="Hardware Counters" /></p>

<h2 id="how-it-works-1">How it works</h2>

<p>When we attach with ClrMD to the benchmarked process we ask it for the asm instructions for given address. The address is Instruction Pointer (IP).</p>

<p>The other diagnoser is <a href="https://adamsitnik.com/Hardware-Counters-ETW/">using ETW</a> to gather the PMC events. Each event comes with hardware counter type, interval, Instruction Pointer and process Id.</p>

<p>When we detect that user is using both diagnosers we enable <a href="https://github.com/dotnet/BenchmarkDotNet/blob/master/src/BenchmarkDotNet.Core/Exporters/InstructionPointerExporter.cs">Instruction Pointer exporter</a>. It eliminates the noise (events with IPs that don’t belong to the benchmarked code like BenchmarkDotNet engine) and aggregates the results.</p>

<h2 id="skid">Skid</h2>

<p>Please keep in mind that we just show what we get. The PMC events are usually delayed. They are collected in Event-Based Sampling (EBS) mode. When the event occurs, the counter increments and when it reaches the max interval value the event is fired with current Instruction Pointer (<a href="https://openlab.web.cern.ch/sites/openlab.web.cern.ch/files/technical_documents/TheOverheadOfProfilingUsingPMUhardwareCounters.pdf">good explanation</a>). We try to overcome the side effects of this by running a lot of iterations of the benchmarked code. If your processor support PEBS it should also help.</p>

<p class="center"><img src="/images/disasm/hardwareCounters_skid.png" alt="Skid" /></p>

<p>As you can see instructions without branches report branching events. I used arrows to show the real instructions for each branch.</p>

<p>If you are interested to learn more about skid testing I encourage you to try simple but very smart “<a href="https://github.com/brendangregg/skid-testing">Processor PMC event skid testing</a>” by Brendan Gregg. In his case it was over 99% skids.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Disassembly Diagnoser Disassembly Diagnoser is the new diagnoser for BenchmarkDotNet that I have just finished. It was released as part of 0.10.10. It allows to disassemble the benchmarked .NET code: to ASM: desktop .NET: LegacyJit (32 &amp; 64 bit), RyuJIT (64 bit) .NET Core 1.1+ (including .NET Core 2.0) for RyuJIT (64 bit) Mono: 32 &amp; 64 bit, including LLVM to IL and corresponding C# code: desktop .NET: LegacyJit (32 &amp; 64 bit), RyuJIT (64 bit) .NET Core: 1.1+ (including .NET Core 2.0) With a single config!]]></summary></entry><entry><title type="html">Span</title><link href="https://adamsitnik.com/Span/" rel="alternate" type="text/html" title="Span" /><published>2017-07-13T00:00:00+00:00</published><updated>2017-07-13T00:00:00+00:00</updated><id>https://adamsitnik.com/Span</id><content type="html" xml:base="https://adamsitnik.com/Span/"><![CDATA[<p>tl;dr Use Span<T> to work with ANY kind of memory in a safe and very efficient way. Simplify your APIs and use the full power of unmanaged memory!</T></p>

<h1 class="no_toc" id="contents">Contents</h1>

<ul id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#the-problem" id="markdown-toc-the-problem">The Problem</a></li>
  <li><a href="#span-is-the-solution" id="markdown-toc-span-is-the-solution">Span is the Solution</a>    <ul>
      <li><a href="#prerequisites" id="markdown-toc-prerequisites">Prerequisites</a></li>
      <li><a href="#using-span" id="markdown-toc-using-span">Using Span</a></li>
      <li><a href="#api-simplicity" id="markdown-toc-api-simplicity">API Simplicity</a></li>
      <li><a href="#how-does-it-work" id="markdown-toc-how-does-it-work">How does it work?</a></li>
    </ul>
  </li>
  <li><a href="#performance" id="markdown-toc-performance">Performance</a>    <ul>
      <li><a href="#slicing-without-managed-heap-allocations" id="markdown-toc-slicing-without-managed-heap-allocations">Slicing without managed heap allocations!</a></li>
      <li><a href="#slow-vs-fast-span" id="markdown-toc-slow-vs-fast-span">Slow vs Fast Span</a></li>
      <li><a href="#span-vs-array" id="markdown-toc-span-vs-array">Span vs Array</a></li>
    </ul>
  </li>
  <li><a href="#the-limitations" id="markdown-toc-the-limitations">The Limitations</a>    <ul>
      <li><a href="#stack-only" id="markdown-toc-stack-only">Stack-only</a></li>
      <li><a href="#no-heap" id="markdown-toc-no-heap">No Heap</a>        <ul>
          <li><a href="#span-must-not-be-a-field-in-non-stackonly-type" id="markdown-toc-span-must-not-be-a-field-in-non-stackonly-type">Span must not be a field in non-stackonly type</a></li>
          <li><a href="#span-must-not-implement-any-existing-interface" id="markdown-toc-span-must-not-implement-any-existing-interface">Span must not implement any existing interface</a></li>
          <li><a href="#span-must-not-be-a-parameter-for-async-method" id="markdown-toc-span-must-not-be-a-parameter-for-async-method">Span must not be a parameter for async method</a></li>
          <li><a href="#span-must-not-be-a-generic-type-argument" id="markdown-toc-span-must-not-be-a-generic-type-argument">Span must not be a generic type argument</a></li>
        </ul>
      </li>
      <li><a href="#memory" id="markdown-toc-memory">Memory</a></li>
    </ul>
  </li>
  <li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
  <li><a href="#sources" id="markdown-toc-sources">Sources</a></li>
</ul>

<h1 id="introduction">Introduction</h1>

<p>C# gives us great flexibility when it comes to using different kinds of memory. But the majority of the developers use only the managed one. Let’s take a brief look at what C# has to offer for us:</p>

<ul>
  <li>Stack memory - allocated on the Stack with the <code class="language-plaintext highlighter-rouge">stackalloc</code> keyword. Very fast allocation and deallocation. The size of the Stack is very small (usually &lt; 1 MB) and fits well into CPU cache. But when you try to allocate more, you get <code class="language-plaintext highlighter-rouge">StackOverflowException</code> which can not be handled and immediately kills the entire process. Usage is also limited by the very short lifetime of the stack - when the method ends, the stack gets unwinded together with its memory. Stackalloc is commonly used for short operations that must not allocate any managed memory. An example is very fast logging of ETW events in <a href="https://github.com/dotnet/corefx/blob/master/src/System.Buffers/src/System/Buffers/ArrayPoolEventSource.cs#L34">corefx</a>: it has to be as fast as possible and needs very little of memory (so the size limitation is not a problem).</li>
</ul>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">internal</span> <span class="k">unsafe</span> <span class="k">void</span> <span class="nf">BufferRented</span><span class="p">(</span><span class="kt">int</span> <span class="n">bufferId</span><span class="p">,</span> <span class="kt">int</span> <span class="n">bufferSize</span><span class="p">,</span> <span class="kt">int</span> <span class="n">poolId</span><span class="p">,</span> <span class="kt">int</span> <span class="n">bucketId</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">EventData</span><span class="p">*</span> <span class="n">payload</span> <span class="p">=</span> <span class="k">stackalloc</span> <span class="n">EventData</span><span class="p">[</span><span class="m">4</span><span class="p">];</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">0</span><span class="p">].</span><span class="n">Size</span> <span class="p">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">0</span><span class="p">].</span><span class="n">DataPointer</span> <span class="p">=</span> <span class="p">((</span><span class="n">IntPtr</span><span class="p">)(&amp;</span><span class="n">bufferId</span><span class="p">));</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">1</span><span class="p">].</span><span class="n">Size</span> <span class="p">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">1</span><span class="p">].</span><span class="n">DataPointer</span> <span class="p">=</span> <span class="p">((</span><span class="n">IntPtr</span><span class="p">)(&amp;</span><span class="n">bufferSize</span><span class="p">));</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">2</span><span class="p">].</span><span class="n">Size</span> <span class="p">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">2</span><span class="p">].</span><span class="n">DataPointer</span> <span class="p">=</span> <span class="p">((</span><span class="n">IntPtr</span><span class="p">)(&amp;</span><span class="n">poolId</span><span class="p">));</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">3</span><span class="p">].</span><span class="n">Size</span> <span class="p">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">int</span><span class="p">);</span>
    <span class="n">payload</span><span class="p">[</span><span class="m">3</span><span class="p">].</span><span class="n">DataPointer</span> <span class="p">=</span> <span class="p">((</span><span class="n">IntPtr</span><span class="p">)(&amp;</span><span class="n">bucketId</span><span class="p">));</span>
    <span class="nf">WriteEventCore</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="n">payload</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<ul>
  <li>Unmanaged memory - allocated on the unmanaged heap (invisible to GC) by calling <code class="language-plaintext highlighter-rouge">Marshal.AllocHGlobal</code> or <code class="language-plaintext highlighter-rouge">Marshal.AllocCoTaskMem</code> methods. This memory must be released by the developer with an explicit call to <code class="language-plaintext highlighter-rouge">Marshal.FreeHGlobal</code> or <code class="language-plaintext highlighter-rouge">Marshal.FreeCoTaskMem</code>. By using it we don’t add any extra pressure for the GC. It’s most commonly used to avoid GC in scenarios where you would normally allocate huge arrays of value types without pointers. <a href="https://github.com/aspnet/KestrelHttpServer/search?utf8=%E2%9C%93&amp;q=Marshal.Alloc&amp;type=">Here</a> you can see some real-life use cases from Kestrel.</li>
  <li>Managed memory - We can allocate it with the <code class="language-plaintext highlighter-rouge">new</code> operator. It’s called managed because it’s managed by the Garbage Collector (GC). GC decides when to free the memory, the developer doesn’t need to worry about it. As described in one of my <a href="https://adamsitnik.com/Array-Pool/#large-object-heap-loh">previous</a> blog posts, the GC divides managed objects into two categories:
    <ul>
      <li>Small objects (<code class="language-plaintext highlighter-rouge">size &lt; 85 000 bytes</code>) - allocated in the generational part of the managed heap. The allocation of small objects is fast. When they are promoted to older generations, their memory is usually being <strong>copied</strong>. The <strong>deallocation is non-deterministic and blocking</strong>. Short-lived objects are cleaned up in the very fast Gen 0 (or Gen 1) collection. The long living ones are subject of the Gen 2 collection, which usually is very time-consuming.</li>
      <li>Large objects (<code class="language-plaintext highlighter-rouge">size &gt;= 85 000 bytes</code>) - allocated in the Large Object Heap (LOH). Managed with the free list algorithm, which offers slower allocation and can lead to memory fragmentation. The advantage is that large objects are by default never copied. This behavior can be changed <a href="https://blogs.msdn.microsoft.com/mariohewardt/2013/06/26/no-more-memory-fragmentation-on-the-net-large-object-heap/">on demand</a>. LOH has very expensive deallocation (<a href="https://adamsitnik.com/Array-Pool/#the-problem">Full GC</a>) which can be minimized by using <a href="https://adamsitnik.com/Array-Pool/#the-solution">ArrayPool</a>. 
<!--more--></li>
    </ul>
  </li>
</ul>

<p><strong>Note</strong>: When I say that given GC operation is slow I don’t mean that the .NET GC is slow. <a href="https://blogs.msdn.microsoft.com/maoni/2017/02/18/how-to-evaluate-info-you-read-on-garbage-collectors/">.NET has a great GC</a>, but No GC is better than any GC.</p>

<h1 id="the-problem">The Problem</h1>

<p>When was the last time you have seen a public .NET API that was accepting pointers? During my recent NDC Oslo talk, I have <a href="https://www.youtube.com/watch?v=CSPSvBeqJ9c&amp;feature=youtu.be&amp;t=741">asked</a> the audience who has ever used <code class="language-plaintext highlighter-rouge">stackalloc</code>. One person (Kristian) from more than one hundred has raised the hand. Why is it so uncommon to use the native memory in C#?</p>

<p>Let’s try to answer this question by designing an API for parsing integers for all kinds of memory.</p>

<p>We start with a <code class="language-plaintext highlighter-rouge">string</code> which is a managed representation of buffer of characters.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">Parse</span><span class="p">(</span><span class="kt">string</span> <span class="n">managedMemory</span><span class="p">);</span> <span class="c1">// allows us to parse the whole string</span>
</code></pre></div></div>

<p>What if we want to parse only selected part of the string?</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">Parse</span><span class="p">(</span><span class="kt">string</span> <span class="n">managedMemory</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span> <span class="c1">// allows us to parse part of the string</span>
</code></pre></div></div>

<p>Ok, so let’s support the unmanaged memory now:</p>
<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">Parse</span><span class="p">(</span><span class="kt">char</span><span class="p">*</span> <span class="n">pointerToUnmanagedMemory</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span> <span class="c1">// allows us to parse characters stored on the unmanaged heap / stack</span>
<span class="k">unsafe</span> <span class="kt">int</span> <span class="nf">Parse</span><span class="p">(</span><span class="kt">char</span><span class="p">*</span> <span class="n">pointerToUnmanagedMemory</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span> <span class="c1">// allows us to parse part of the characters stored on the unmanaged heap / stack</span>
</code></pre></div></div>

<p>It’s already four overloads and I am pretty sure that I have missed something ;)</p>

<p>Now let’s design an API for copying blocks of memory:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">T</span><span class="p">[]</span> <span class="n">source</span><span class="p">,</span> <span class="n">T</span><span class="p">[]</span> <span class="n">destination</span><span class="p">);</span> 
<span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">T</span><span class="p">[]</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">sourceStartIndex</span><span class="p">,</span> <span class="n">T</span><span class="p">[]</span> <span class="n">destination</span><span class="p">,</span> <span class="kt">int</span> <span class="n">destinationStartIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementsCount</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">void</span><span class="p">*</span> <span class="n">source</span><span class="p">,</span> <span class="k">void</span><span class="p">*</span> <span class="n">destination</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementsCount</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">void</span><span class="p">*</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">sourceStartIndex</span><span class="p">,</span> <span class="k">void</span><span class="p">*</span> <span class="n">destination</span><span class="p">,</span> <span class="kt">int</span> <span class="n">destinationStartIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementsCount</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">void</span><span class="p">*</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">sourceLength</span><span class="p">,</span> <span class="n">T</span><span class="p">[]</span> <span class="n">destination</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="k">void</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">void</span><span class="p">*</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">sourceStartIndex</span><span class="p">,</span> <span class="n">T</span><span class="p">[]</span> <span class="n">destination</span><span class="p">,</span> <span class="kt">int</span> <span class="n">destinationStartIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementsCount</span><span class="p">);</span>
</code></pre></div></div>

<p><strong>Update:</strong> We don’t need to worry about handling <code class="language-plaintext highlighter-rouge">long</code> parameters. The Array in .NET has a method <code class="language-plaintext highlighter-rouge">GetLongLength</code> but it never returns value bigger than int.Max.</p>

<p><strong>As you can see supporting any kind of memory was previously hard and problematic.</strong></p>

<h1 id="span-is-the-solution">Span is the Solution</h1>

<p><code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code> (<a href="https://github.com/joeduffy/slice.net">previously</a> called <code class="language-plaintext highlighter-rouge">Slice</code>) is a simple <a href="https://adamsitnik.com/Value-Types-vs-Reference-Types/">value type</a> that allows us to work with any kind of contiguous memory:</p>
<ul>
  <li>Unmanaged memory buffers</li>
  <li>Arrays and subarrays</li>
  <li>Strings and substrings</li>
</ul>

<p>It <strong>ensures memory and type safety</strong> and has almost no overhead.</p>

<h2 id="prerequisites">Prerequisites</h2>

<p>To work with Span you need to install the latest <a href="https://www.nuget.org/packages/System.Memory/">System.Memory</a> NuGet package and set LangVersion to C# 7.2 or newer.</p>

<div class="language-xml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;PropertyGroup&gt;</span>
  <span class="nt">&lt;LangVersion&gt;</span>7.2<span class="nt">&lt;/LangVersion&gt;</span>
<span class="nt">&lt;/PropertyGroup&gt;</span>
</code></pre></div></div>

<p><strong>Note:</strong> Older compilers might give you errors like this:</p>

<pre><code class="language-log">Error CS8107 Feature 'ref structs' is not available in C# 7.0. Please use language version 7.2 or greater.
</code></pre>

<h2 id="using-span">Using Span</h2>

<p>I would say that you can think of it like of an array, which does all the pointer arithmetic for you, but internally can point to any kind of memory.</p>

<p>We can create <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code> for unmanaged memory:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">stackMemory</span> <span class="p">=</span> <span class="k">stackalloc</span> <span class="kt">byte</span><span class="p">[</span><span class="m">256</span><span class="p">];</span> <span class="c1">// C# 7.2 </span>

<span class="n">IntPtr</span> <span class="n">unmanagedHandle</span> <span class="p">=</span> <span class="n">Marshal</span><span class="p">.</span><span class="nf">AllocHGlobal</span><span class="p">(</span><span class="m">256</span><span class="p">);</span>
<span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">unmanaged</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;(</span><span class="n">unmanagedHandle</span><span class="p">.</span><span class="nf">ToPointer</span><span class="p">(),</span> <span class="m">256</span><span class="p">);</span> 
<span class="n">Marshal</span><span class="p">.</span><span class="nf">FreeHGlobal</span><span class="p">(</span><span class="n">unmanagedHandle</span><span class="p">);</span>
</code></pre></div></div>

<p>There is even implicit cast operator from <code class="language-plaintext highlighter-rouge">T[]</code> to <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code>:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">char</span><span class="p">[]</span> <span class="n">array</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">char</span><span class="p">[]</span> <span class="p">{</span> <span class="sc">'i'</span><span class="p">,</span> <span class="sc">'m'</span><span class="p">,</span> <span class="sc">'p'</span><span class="p">,</span> <span class="sc">'l'</span><span class="p">,</span> <span class="sc">'i'</span><span class="p">,</span> <span class="sc">'c'</span><span class="p">,</span> <span class="sc">'i'</span><span class="p">,</span> <span class="sc">'t'</span> <span class="p">};</span>
<span class="n">Span</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="n">fromArray</span> <span class="p">=</span> <span class="n">array</span><span class="p">;</span> <span class="c1">// implicit cast</span>
</code></pre></div></div>

<p>There is also <code class="language-plaintext highlighter-rouge">ReadOnlySpan&lt;T&gt;</code> which can be used to work with strings or other immutable types.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="n">fromString</span> <span class="p">=</span> <span class="s">"Strings in .NET are immutable"</span><span class="p">.</span><span class="nf">AsSpan</span><span class="p">();</span>
</code></pre></div></div>

<p>Once you create it, you can use it in a way you would typically use an array - it has a <code class="language-plaintext highlighter-rouge">Length</code> property and an <code class="language-plaintext highlighter-rouge">[indexer]</code>, which allows to get and set the values.</p>

<p>The simplified API (full can be found <a href="https://github.com/dotnet/corefx/blob/master/src/System.Memory/ref/System.Memory.cs">here</a>):</p>
<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">Span</span><span class="p">(</span><span class="n">T</span><span class="p">[]</span> <span class="n">array</span><span class="p">);</span>
<span class="nf">Span</span><span class="p">(</span><span class="n">T</span><span class="p">[]</span> <span class="n">array</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">);</span>
<span class="nf">Span</span><span class="p">(</span><span class="n">T</span><span class="p">[]</span> <span class="n">array</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="nf">Span</span><span class="p">(</span><span class="k">void</span><span class="p">*</span> <span class="n">memory</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span>

<span class="kt">int</span> <span class="n">Length</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="p">}</span>
<span class="k">ref</span> <span class="n">T</span> <span class="k">this</span><span class="p">[</span><span class="kt">int</span> <span class="n">index</span><span class="p">]</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>

<span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">(</span><span class="kt">int</span> <span class="n">start</span><span class="p">);</span>
<span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">(</span><span class="kt">int</span> <span class="n">start</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">);</span>

<span class="k">void</span> <span class="nf">Clear</span><span class="p">();</span>
<span class="k">void</span> <span class="nf">Fill</span><span class="p">(</span><span class="n">T</span> <span class="k">value</span><span class="p">);</span>

<span class="k">void</span> <span class="nf">CopyTo</span><span class="p">(</span><span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">destination</span><span class="p">);</span>
<span class="kt">bool</span> <span class="nf">TryCopyTo</span><span class="p">(</span><span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">destination</span><span class="p">);</span>
</code></pre></div></div>

<h2 id="api-simplicity">API Simplicity</h2>

<p>Let’s redesign parsing and copying APIs mentioned earlier and take advantage of <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code>:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">Parse</span><span class="p">(</span><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="n">anyMemory</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">Copy</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">source</span><span class="p">,</span> <span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">destination</span><span class="p">);</span>
</code></pre></div></div>

<p><strong>As simple as it gets! Span abstracts almost everything about memory. It makes using the unmanaged memory much easier both for APIs producers and consumers.</strong></p>

<h2 id="how-does-it-work">How does it work?</h2>

<p>There are two versions of Span:</p>

<ul>
  <li>For the runtimes existing prior to Span.</li>
  <li>For the new runtimes, which implement native support for Spans.</li>
</ul>

<p>The version for the existing runtimes is implemented in <a href="https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/Span.cs">corefx</a>. .NET Core 2.0 is so far the only runtime that <a href="https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Span.Fast.cs">implements</a> native support for Span.</p>

<p>The Span for existing Runtimes consists of three fields: reference (represented by simple reference type field), byteOffset (IntPtr) and length (int, not long). When we access n-th value, the indexer does the pointer arithmetic for us (pseudocode):</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ref</span> <span class="n">T</span> <span class="k">this</span><span class="p">[</span><span class="kt">int</span> <span class="n">index</span><span class="p">]</span>
<span class="p">{</span>
    <span class="k">get</span> <span class="p">=&gt;</span> <span class="k">ref</span> <span class="p">((</span><span class="k">ref</span> <span class="n">reference</span> <span class="p">+</span> <span class="n">byteOffset</span><span class="p">)</span> <span class="p">+</span> <span class="n">index</span> <span class="p">*</span> <span class="nf">sizeOf</span><span class="p">(</span><span class="n">T</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>

<p class="center"><img src="/images/span/existingRuntimes.png" alt="Span for existing runtimes" /></p>

<p>The new GC knows how to deal with Span, the reference and byteOffset fields got merged into an interior pointer. New GC is aware of the fact that it’s merged reference and it has the native support for updating this reference when it’s needed during the Compact phase of the garbage collection (when the underlying object like array is moved, the reference needs to be updated, the offset needs to remain untouched).</p>

<p class="center"><img src="/images/span/newRuntimes.png" alt="Span for new runtimes" /></p>

<p>IL and C# do not support <code class="language-plaintext highlighter-rouge">ref T</code> fields. .NET Core 2.0+ implements it by representing it via <a href="https://github.com/dotnet/coreclr/blob/3b6d990eef152bac4134318940b08b050982c0f0/src/System.Private.CoreLib/shared/System/ByReference.cs">ByReference</a> type, which is just a JIT intrinsic.</p>

<h1 id="performance">Performance</h1>

<h2 id="slicing-without-managed-heap-allocations">Slicing without managed heap allocations!</h2>

<p>Slicing (taking part of some memory) is core feature of Span. It does not copy any memory, it simply creates Span with different pointer and length (and offset for the old runtimes).</p>

<p>Why is it so important? Because so far in .NET anytime we wanted to substring a string the .NET was allocating new string for us and copying the desired content to it. Pseudocode:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">string</span> <span class="nf">Substring</span><span class="p">(</span><span class="kt">string</span> <span class="n">text</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">)</span>
<span class="p">{</span>
    <span class="kt">string</span> <span class="n">result</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">string</span><span class="p">(</span><span class="n">length</span><span class="p">);</span> <span class="c1">// ALLOCATION!</span>
        
    <span class="n">Memory</span><span class="p">.</span><span class="nf">Copy</span><span class="p">(</span><span class="n">source</span><span class="p">:</span> <span class="n">text</span><span class="p">,</span> <span class="n">destinaiton</span><span class="p">:</span> <span class="n">result</span><span class="p">,</span> <span class="n">startIndex</span><span class="p">,</span> <span class="n">length</span><span class="p">);</span> <span class="c1">// COPYING MEMORY</span>
        
    <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>With Span, there is no allocation! Pseudocode:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">(</span><span class="kt">string</span> <span class="n">text</span><span class="p">,</span> <span class="kt">int</span> <span class="n">startIndex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">)</span>
    <span class="p">=&gt;</span> <span class="k">new</span> <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;(</span>
        <span class="k">ref</span> <span class="n">text</span><span class="p">[</span><span class="m">0</span><span class="p">]</span> <span class="p">+</span> <span class="p">(</span><span class="n">startIndex</span> <span class="p">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">char</span><span class="p">)),</span> 
        <span class="n">length</span><span class="p">);</span>
</code></pre></div></div>

<p>Let’s measure the difference with <a href="https://benchmarkdotnet.org/">BenchmarkDotNet</a>:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">MemoryDiagnoser</span><span class="p">]</span>
<span class="p">[</span><span class="nf">Config</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">DontForceGcCollectionsConfig</span><span class="p">))]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">SubstringVsSubslice</span>
<span class="p">{</span>
    <span class="k">private</span> <span class="kt">string</span> <span class="n">Text</span><span class="p">;</span>

    <span class="p">[</span><span class="nf">Params</span><span class="p">(</span><span class="m">10</span><span class="p">,</span> <span class="m">1000</span><span class="p">)]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">CharactersCount</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span>

    <span class="p">[</span><span class="n">GlobalSetup</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">Setup</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">Text</span> <span class="p">=</span> <span class="k">new</span> <span class="kt">string</span><span class="p">(</span><span class="n">Enumerable</span><span class="p">.</span><span class="nf">Repeat</span><span class="p">(</span><span class="sc">'a'</span><span class="p">,</span> <span class="n">CharactersCount</span><span class="p">).</span><span class="nf">ToArray</span><span class="p">());</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">string</span> <span class="nf">Substring</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">Text</span><span class="p">.</span><span class="nf">Substring</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">Text</span><span class="p">.</span><span class="n">Length</span> <span class="p">/</span> <span class="m">2</span><span class="p">);</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">char</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">Text</span><span class="p">.</span><span class="nf">AsSpan</span><span class="p">().</span><span class="nf">Slice</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="n">Text</span><span class="p">.</span><span class="n">Length</span> <span class="p">/</span> <span class="m">2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)
Processor=Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0
  Job-NJYLUU : Clr 4.0.30319.42000, 64bit RyuJIT-v4.7.2053.0

Force=False  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>CharactersCount</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">StdDev</th>
        <th style="text-align: right">Scaled</th>
        <th style="text-align: right">Gen 0</th>
        <th style="text-align: right">Allocated</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Substring</td>
        <td>10</td>
        <td style="text-align: right"><strong>8.277</strong> ns</td>
        <td style="text-align: right">0.1938 ns</td>
        <td style="text-align: right"><strong>4.54</strong></td>
        <td style="text-align: right"><strong>0.0191</strong></td>
        <td style="text-align: right"><strong>40 B</strong></td>
      </tr>
      <tr>
        <td>Slice</td>
        <td>10</td>
        <td style="text-align: right"><strong>1.822</strong> ns</td>
        <td style="text-align: right"><strong>0.0383</strong> ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>-</strong></td>
        <td style="text-align: right">0 B</td>
      </tr>
      <tr>
        <td>Substring</td>
        <td>1000</td>
        <td style="text-align: right"><strong>85.518</strong> ns</td>
        <td style="text-align: right">1.3474 ns</td>
        <td style="text-align: right"><strong>47.22</strong></td>
        <td style="text-align: right"><strong>0.4919</strong></td>
        <td style="text-align: right"><strong>1032 B</strong></td>
      </tr>
      <tr>
        <td>Slice</td>
        <td>1000</td>
        <td style="text-align: right"><strong>1.811</strong> ns</td>
        <td style="text-align: right"><strong>0.0205</strong> ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>-</strong></td>
        <td style="text-align: right">0 B</td>
      </tr>
    </tbody>
  </table>

</div>
<p>It’s clear that:</p>

<ul>
  <li><strong>Slicing does not allocate any managed heap memory.</strong> Substring allocates. (Allocated column)</li>
  <li>Slicing is much faster (Mean column)</li>
  <li><strong>Slicing has constant cost!</strong> Look at the values for Standard Deviation and Mean for CharactersCount = 10 and 1000!</li>
</ul>

<h2 id="slow-vs-fast-span">Slow vs Fast Span</h2>

<p>Some people call the 3-field Span “slow Span” and the 2-field “fast Span”. BenchmarkDotNet allows running same benchmark for multiple runtimes. Let’s use it and compare the indexer for .NET 4.6, .NET Core 1.1 and .NET Core 2.0.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">Config</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">MultipleRuntimesConfig</span><span class="p">))]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">SpanIndexer</span>
<span class="p">{</span>
    <span class="k">protected</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">Loops</span> <span class="p">=</span> <span class="m">100</span><span class="p">;</span>
    <span class="k">protected</span> <span class="k">const</span> <span class="kt">int</span> <span class="n">Count</span> <span class="p">=</span> <span class="m">1000</span><span class="p">;</span>

    <span class="k">protected</span> <span class="kt">byte</span><span class="p">[]</span> <span class="n">arrayField</span><span class="p">;</span>

    <span class="p">[</span><span class="n">GlobalSetup</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">Setup</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">arrayField</span> <span class="p">=</span> <span class="n">Enumerable</span><span class="p">.</span><span class="nf">Repeat</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">Count</span><span class="p">).</span><span class="nf">Select</span><span class="p">((</span><span class="n">val</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="p">(</span><span class="kt">byte</span><span class="p">)</span><span class="n">index</span><span class="p">).</span><span class="nf">ToArray</span><span class="p">();</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="n">Loops</span> <span class="p">*</span> <span class="n">Count</span><span class="p">)]</span>
    <span class="k">public</span> <span class="kt">byte</span> <span class="nf">SpanIndexer_Get</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayField</span><span class="p">;</span> <span class="c1">// implicit cast to Span, we can't have Span as a field!</span>
        <span class="kt">byte</span> <span class="n">result</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">_</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">_</span> <span class="p">&lt;</span> <span class="n">Loops</span><span class="p">;</span> <span class="n">_</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">j</span><span class="p">++)</span>
            <span class="p">{</span>
                <span class="n">result</span> <span class="p">=</span> <span class="n">local</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="n">Loops</span> <span class="p">*</span> <span class="n">Count</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">SpanIndexer_Set</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayField</span><span class="p">;</span> <span class="c1">// implicit cast to Span, we can't have Span as a field!</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">_</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">_</span> <span class="p">&lt;</span> <span class="n">Loops</span><span class="p">;</span> <span class="n">_</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">j</span><span class="p">++)</span>
            <span class="p">{</span>
                <span class="n">local</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="p">=</span> <span class="kt">byte</span><span class="p">.</span><span class="n">MaxValue</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">public</span> <span class="k">class</span> <span class="nc">MultipleRuntimesConfig</span> <span class="p">:</span> <span class="n">ManualConfig</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">MultipleRuntimesConfig</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Default</span>
                <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjClassicNetToolchain</span><span class="p">.</span><span class="n">Net46</span><span class="p">)</span> <span class="c1">// Span NOT supported by Runtime</span>
                <span class="p">.</span><span class="nf">WithId</span><span class="p">(</span><span class="s">".NET 4.6"</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Default</span>
                <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp11</span><span class="p">)</span> <span class="c1">// Span NOT supported by Runtime</span>
                <span class="p">.</span><span class="nf">WithId</span><span class="p">(</span><span class="s">".NET Core 1.1"</span><span class="p">));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span><span class="p">.</span><span class="n">Default</span>
                <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">CsProjCoreToolchain</span><span class="p">.</span><span class="n">NetCoreApp20</span><span class="p">)</span> <span class="c1">// Span SUPPORTED by Runtime</span>
                <span class="p">.</span><span class="nf">WithId</span><span class="p">(</span><span class="s">".NET Core 2.0"</span><span class="p">));</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)</span>
<span class="py">Processor</span><span class="p">=</span><span class="s">Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4</span>
<span class="py">Frequency</span><span class="p">=</span><span class="s">2742189 Hz, Resolution=364.6722 ns, Timer=TSC</span>
  <span class="nn">[Host]</span>        <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>
  <span class="err">.NET</span> <span class="err">4.6</span>      <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>
  <span class="err">.NET</span> <span class="err">Core</span> <span class="err">1.1</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">4.6.25211.01,</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
  <span class="err">.NET</span> <span class="err">Core</span> <span class="err">2.0</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">4.6.25302.01,</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Job</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.6054 ns</td>
        <td style="text-align: right">0.0007 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.6047 ns</td>
        <td style="text-align: right">0.0008 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right"><strong>0.5333 ns</strong></td>
        <td style="text-align: right">0.0006 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.6059 ns</td>
        <td style="text-align: right">0.0007 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.6042 ns</td>
        <td style="text-align: right">0.0002 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right"><strong>0.5205 ns</strong></td>
        <td style="text-align: right">0.0003 ns</td>
      </tr>
    </tbody>
  </table>

</div>
<p>The difference is around 12-14%. In my opinion, it proves that people should not be afraid of using the “slow” Span for existing runtimes. But there is some place for further improvement for the new runtimes! So the gap might get bigger soon.</p>

<p><strong>Note:</strong> Please keep in mind that this benchmark is not perfect indexer benchmark. It relies heavily on the CPU cache. I am not sure if it is even possible to design a perfect benchmark for the indexer.</p>

<h2 id="span-vs-array">Span vs Array</h2>

<p>When we take a look at the official <a href="https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md#requirements">requirements</a> for Span we can find:</p>

<blockquote>
  <p>“Performance characteristics on par with arrays.”</p>
</blockquote>

<p>Let’s measure it ;)</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">class</span> <span class="nc">SpanVsArray_Indexer</span> <span class="p">:</span> <span class="n">SpanIndexer</span>
<span class="p">{</span>
    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="n">Loops</span> <span class="p">*</span> <span class="n">Count</span><span class="p">)]</span>
    <span class="k">public</span> <span class="kt">byte</span> <span class="nf">ArrayIndexer_Get</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayField</span><span class="p">;</span>
        <span class="kt">byte</span> <span class="n">result</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">_</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">_</span> <span class="p">&lt;</span> <span class="n">Loops</span><span class="p">;</span> <span class="n">_</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">j</span><span class="p">++)</span>
            <span class="p">{</span>
                <span class="n">result</span> <span class="p">=</span> <span class="n">local</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">result</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="n">Loops</span> <span class="p">*</span> <span class="n">Count</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ArrayIndexer_Set</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="n">arrayField</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">_</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">_</span> <span class="p">&lt;</span> <span class="n">Loops</span><span class="p">;</span> <span class="n">_</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="p">&lt;</span> <span class="n">local</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">j</span><span class="p">++)</span>
            <span class="p">{</span>
                <span class="n">local</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="p">=</span> <span class="kt">byte</span><span class="p">.</span><span class="n">MaxValue</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.10.8, OS=Windows 10 Redstone 1 (10.0.14393)</span>
<span class="py">Processor</span><span class="p">=</span><span class="s">Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4</span>
<span class="py">Frequency</span><span class="p">=</span><span class="s">2742189 Hz, Resolution=364.6722 ns, Timer=TSC</span>
  <span class="nn">[Host]</span>        <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>
  <span class="err">.NET</span> <span class="err">4.6</span>      <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>
  <span class="err">.NET</span> <span class="err">Core</span> <span class="err">1.1</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">4.6.25211.01,</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
  <span class="err">.NET</span> <span class="err">Core</span> <span class="err">2.0</span> <span class="err">:</span> <span class="err">.NET</span> <span class="err">Core</span> <span class="err">4.6.25302.01,</span> <span class="err">64bit</span> <span class="err">RyuJIT</span>
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Job</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>ArrayIndexer_Get</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.5499 ns</td>
        <td style="text-align: right">0.0009 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.6073 ns</td>
        <td style="text-align: right">0.0016 ns</td>
      </tr>
      <tr>
        <td>ArrayIndexer_Get</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.5455 ns</td>
        <td style="text-align: right">0.0006 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.6062 ns</td>
        <td style="text-align: right">0.0008 ns</td>
      </tr>
      <tr>
        <td>ArrayIndexer_Get</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right">0.5401 ns</td>
        <td style="text-align: right">0.0019 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Get</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right">0.5357 ns</td>
        <td style="text-align: right">0.0010 ns</td>
      </tr>
      <tr>
        <td>ArrayIndexer_Set</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.5028 ns</td>
        <td style="text-align: right">0.0010 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET 4.6</td>
        <td style="text-align: right">0.6057 ns</td>
        <td style="text-align: right">0.0005 ns</td>
      </tr>
      <tr>
        <td>ArrayIndexer_Set</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.5074 ns</td>
        <td style="text-align: right">0.0013 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET Core 1.1</td>
        <td style="text-align: right">0.6056 ns</td>
        <td style="text-align: right">0.0008 ns</td>
      </tr>
      <tr>
        <td>ArrayIndexer_Set</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right">0.5069 ns</td>
        <td style="text-align: right">0.0010 ns</td>
      </tr>
      <tr>
        <td>SpanIndexer_Set</td>
        <td>.NET Core 2.0</td>
        <td style="text-align: right">0.5219 ns</td>
        <td style="text-align: right">0.0005 ns</td>
      </tr>
    </tbody>
  </table>

</div>
<p>The requirement is met only for the new runtime with native Span support, .NET Core 2.0. Personally, I don’t believe that it’s possible the existing runtimes can meet this requirement. There is no way to add some features like array bound check elimination for the existing runtimes.</p>

<h1 id="the-limitations">The Limitations</h1>

<p>Span supports any kind of memory. It means that is should have the same restrictions as the most demanding type of memory.</p>

<p>In our case, it’s stack memory. Pointer to stack must not be stored on the managed heap. When the method ends, the stack gets unwinded and the pointer becomes invalid. If we somehow store it on the heap, bad things are going to happen. Anything else that contains a pointer to stack also must not be stored on the managed heap. It means that Span must not be stored on the managed heap.</p>

<p>Moreover, Span as a value type with more than one field suffers from <a href="https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md#struct-tearing">Struct Tearing</a>. Span is supposed to be very fast, so we can not solve struct tearing issue by introducing access synchronization.</p>

<h2 id="stack-only">Stack-only</h2>

<p>Span is stack-only type:</p>

<ul>
  <li>Span instances can reside only on the Stack</li>
  <li>Stacks are not shared across multiple threads, so single Stack is accessed by one thread at the same time. It ensures thread safety for Span!</li>
  <li>Stacks are short-lived, which means that GC will track fewer pointers. If we would let them live long (on the heap), we could get to a situation where Span creates big overhead for GC.</li>
</ul>

<h2 id="no-heap">No Heap</h2>

<p>Because of the fact that Span is a stack-only type, it must not be stored on the heap. Which leads us to a long list of limitations.</p>

<h3 id="span-must-not-be-a-field-in-non-stackonly-type">Span must not be a field in non-stackonly type</h3>

<p>If you make Span a field in a class it will be stored on the heap. This is prohibited!</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Impossible</span>
<span class="p">{</span>
    <span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">field</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Since <a href="https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.2/span-safety.md#generalized-ref-like-types-in-source-code">C# 7.2</a> it is be possible to have a Span field in other stack-only type.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">ref</span> <span class="k">struct</span> <span class="nc">TwoSpans</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span>
<span class="p">{</span>
	<span class="c1">// can have ref-like instance fields</span>
	<span class="k">public</span> <span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">first</span><span class="p">;</span>
	<span class="k">public</span> <span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">second</span><span class="p">;</span>
<span class="p">}</span> 
</code></pre></div></div>

<h3 id="span-must-not-implement-any-existing-interface">Span must not implement any existing interface</h3>

<p>Let’s consider following C# and IL code:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">void</span> <span class="n">NonConstrained</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">IEnumerable</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">collection</span><span class="p">)</span>
<span class="k">struct</span> <span class="nc">SomeValueType</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="p">:</span> <span class="n">IEnumerable</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="p">{</span> <span class="p">}</span>

<span class="k">void</span> <span class="nf">Demo</span><span class="p">()</span>
<span class="p">{</span>
    <span class="kt">var</span> <span class="k">value</span> <span class="p">=</span> <span class="k">new</span> <span class="n">SomeValueType</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;();</span>
    <span class="nf">NonConstrained</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p class="center"><img src="/images/span/boxing.png" alt="Boxing" /></p>

<p>The value got boxed! Which means stored on the heap. Which is prohibited for the Span. You can read more about boxing in my <a href="https://adamsitnik.com/Value-Types-vs-Reference-Types/#boxing">previous</a> blog post.</p>

<p>The point is that to prevent from boxing Span must not implement any existing interface like <code class="language-plaintext highlighter-rouge">IEnumerable</code>. If in the future C# allows defining an interface that can be implemented only by a <code class="language-plaintext highlighter-rouge">struct</code>, then it might become possible.</p>

<h3 id="span-must-not-be-a-parameter-for-async-method">Span must not be a parameter for async method</h3>

<p><code class="language-plaintext highlighter-rouge">async</code> and <code class="language-plaintext highlighter-rouge">await</code> are awesome C# features. They make our life easier by solving a lot of problems and hiding a lot complexity from us.</p>

<p>But whenever <code class="language-plaintext highlighter-rouge">async</code> &amp; <code class="language-plaintext highlighter-rouge">await</code> are used, an AsyncMethodBuilder is created. The builder creates an asynchronous state machine. Which at some point of time might put the parameters of the method on the <strong>heap</strong>. This is why Span must not be an argument for an async method.</p>

<h3 id="span-must-not-be-a-generic-type-argument">Span must not be a generic type argument</h3>

<p>Let’s consider following C# code:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="nf">Allocate</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="k">new</span> <span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;(</span><span class="k">new</span> <span class="kt">byte</span><span class="p">[</span><span class="m">256</span><span class="p">]);</span>

<span class="k">void</span> <span class="n">CallAndPrint</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">Func</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">valueProvider</span><span class="p">)</span> <span class="c1">// no generic requirements for T</span>
<span class="p">{</span>
    <span class="kt">object</span> <span class="k">value</span> <span class="p">=</span> <span class="n">valueProvider</span><span class="p">.</span><span class="nf">Invoke</span><span class="p">();</span> <span class="c1">// boxing!</span>

    <span class="n">Console</span><span class="p">.</span><span class="nf">WriteLine</span><span class="p">(</span><span class="k">value</span><span class="p">.</span><span class="nf">ToString</span><span class="p">());</span>
<span class="p">}</span>

<span class="k">void</span> <span class="nf">Demo</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">Func</span><span class="p">&lt;</span><span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;&gt;</span> <span class="n">spanProvider</span> <span class="p">=</span> <span class="n">Allocate</span><span class="p">;</span>
    <span class="n">CallAndPrint</span><span class="p">&lt;</span><span class="n">Span</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;&gt;(</span><span class="n">spanProvider</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As you can see the non-boxing requirement can not be ensured today if we allow the Span to be generic type argument. One of the possible solutions could be to introduce new generic constraint: <code class="language-plaintext highlighter-rouge">stackonly</code>. But then all the managed compilers would have to respect it and ensure the lack of boxing and other restrictions. This is why it was decided to simply forbid using Span as a generic argument.</p>

<p>Initially, this requirement was verified at runtime by .NET Core 2.0. Since C# 7.2 it’s also enforced by the compiler at compile time.</p>

<h2 id="memory">Memory</h2>

<p>Memory is a new type which can point only to managed memory, so it does not have stack-only limitation. It can be created out of a managed array, string or <code class="language-plaintext highlighter-rouge">IOwnedMemory</code>, passed to async method(s) or stored in the field of a class. When you need Span, you just call the <code class="language-plaintext highlighter-rouge">.Span</code> property, which creates Span on demand. Then you use it within the current scope.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">readonly</span> <span class="k">struct</span> <span class="nc">Memory</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span>
<span class="p">{</span>
    <span class="k">private</span> <span class="k">readonly</span> <span class="kt">object</span> <span class="n">_object</span><span class="p">;</span> <span class="c1">// String, Array or OwnedMemory</span>
    <span class="k">private</span> <span class="k">readonly</span> <span class="kt">int</span> <span class="n">_index</span><span class="p">;</span>
    <span class="k">private</span> <span class="k">readonly</span> <span class="kt">int</span> <span class="n">_length</span><span class="p">;</span>

    <span class="k">public</span> <span class="n">Span</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="n">Span</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="p">}</span>

    <span class="k">public</span> <span class="n">Memory</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">(</span><span class="kt">int</span> <span class="n">start</span><span class="p">)</span>
    <span class="k">public</span> <span class="n">Memory</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;</span> <span class="nf">Slice</span><span class="p">(</span><span class="kt">int</span> <span class="n">start</span><span class="p">,</span> <span class="kt">int</span> <span class="n">length</span><span class="p">)</span>
    <span class="k">public</span> <span class="n">MemoryHandle</span> <span class="nf">Pin</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Sample usage:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">byte</span><span class="p">[]</span> <span class="n">buffer</span> <span class="p">=</span> <span class="n">ArrayPool</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;.</span><span class="n">Shared</span><span class="p">.</span><span class="nf">Rent</span><span class="p">(</span><span class="m">16000</span> <span class="p">*</span> <span class="m">8</span><span class="p">);</span> <span class="c1">// we use an Array here, not Span</span>

<span class="k">while</span> <span class="p">((</span><span class="n">bytesRead</span> <span class="p">=</span> <span class="k">await</span> <span class="n">fileStream</span><span class="p">.</span><span class="nf">ReadAsync</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="n">buffer</span><span class="p">.</span><span class="n">Length</span><span class="p">))</span> <span class="p">&gt;</span> <span class="m">0</span><span class="p">)</span> <span class="c1">// AWAIT! writing to array</span>
<span class="p">{</span>
    <span class="nf">ParseBlock</span><span class="p">(</span><span class="k">new</span> <span class="n">ReadOnlyMemory</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;(</span><span class="n">buffer</span><span class="p">,</span> <span class="n">start</span><span class="p">:</span> <span class="m">0</span><span class="p">,</span> <span class="n">length</span><span class="p">:</span> <span class="n">bytesRead</span><span class="p">));</span> <span class="c1">// creating ReadOnlyMemory which points to managed array</span>
<span class="p">}</span>

<span class="k">void</span> <span class="nf">ParseBlock</span><span class="p">(</span><span class="n">ReadOnlyMemory</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">memory</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">ReadOnlySpan</span><span class="p">&lt;</span><span class="kt">byte</span><span class="p">&gt;</span> <span class="n">slice</span> <span class="p">=</span> <span class="n">memory</span><span class="p">.</span><span class="n">Span</span><span class="p">;</span> <span class="c1">// using Span from here</span>
<span class="p">}</span>
</code></pre></div></div>

<h1 id="summary">Summary</h1>

<ul>
  <li>Allows to work with any type of memory.</li>
  <li>System.Memory package, C# 7.2.</li>
  <li>It makes working with native memory much easier.</li>
  <li>Simple abstraction over Pointer Arithmetic.</li>
  <li>Avoid allocation and copying of memory with Slicing.</li>
  <li>Supports .NET Standard 1.1+</li>
  <li>Its performance is on par with Array for new runtimes.</li>
  <li>It’s limited due to stack only requirements.</li>
</ul>

<h1 id="sources">Sources</h1>

<ul>
  <li><a href="https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md">Span design document</a></li>
  <li><a href="https://github.com/dotnet/csharplang/blob/master/proposals/csharp-7.2/span-safety.md">Compile time enforcement of safety for ref-like types</a> C# 7.2 feature description by Vladimir Sadov</li>
  <li><a href="https://docs.microsoft.com/en-us/dotnet/standard/net-standard">.NET Standard</a> article by MSDN</li>
  <li><a href="https://github.com/dotnet/coreclr/issues/5851">Span</a> issue in coreclr repo</li>
  <li><a href="https://github.com/dotnet/csharplang/issues/666">Span</a> issue in C# language repo</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[tl;dr Use Span to work with ANY kind of memory in a safe and very efficient way. Simplify your APIs and use the full power of unmanaged memory! Contents Introduction Introduction C# gives us great flexibility when it comes to using different kinds of memory. But the majority of the developers use only the managed one. Let’s take a brief look at what C# has to offer for us: Stack memory - allocated on the Stack with the stackalloc keyword. Very fast allocation and deallocation. The size of the Stack is very small (usually &lt; 1 MB) and fits well into CPU cache. But when you try to allocate more, you get StackOverflowException which can not be handled and immediately kills the entire process. Usage is also limited by the very short lifetime of the stack - when the method ends, the stack gets unwinded together with its memory. Stackalloc is commonly used for short operations that must not allocate any managed memory. An example is very fast logging of ETW events in corefx: it has to be as fast as possible and needs very little of memory (so the size limitation is not a problem). internal unsafe void BufferRented(int bufferId, int bufferSize, int poolId, int bucketId) { EventData* payload = stackalloc EventData[4]; payload[0].Size = sizeof(int); payload[0].DataPointer = ((IntPtr)(&amp;bufferId)); payload[1].Size = sizeof(int); payload[1].DataPointer = ((IntPtr)(&amp;bufferSize)); payload[2].Size = sizeof(int); payload[2].DataPointer = ((IntPtr)(&amp;poolId)); payload[3].Size = sizeof(int); payload[3].DataPointer = ((IntPtr)(&amp;bucketId)); WriteEventCore(1, 4, payload); } Unmanaged memory - allocated on the unmanaged heap (invisible to GC) by calling Marshal.AllocHGlobal or Marshal.AllocCoTaskMem methods. This memory must be released by the developer with an explicit call to Marshal.FreeHGlobal or Marshal.FreeCoTaskMem. By using it we don’t add any extra pressure for the GC. It’s most commonly used to avoid GC in scenarios where you would normally allocate huge arrays of value types without pointers. Here you can see some real-life use cases from Kestrel. Managed memory - We can allocate it with the new operator. It’s called managed because it’s managed by the Garbage Collector (GC). GC decides when to free the memory, the developer doesn’t need to worry about it. As described in one of my previous blog posts, the GC divides managed objects into two categories: Small objects (size &lt; 85 000 bytes) - allocated in the generational part of the managed heap. The allocation of small objects is fast. When they are promoted to older generations, their memory is usually being copied. The deallocation is non-deterministic and blocking. Short-lived objects are cleaned up in the very fast Gen 0 (or Gen 1) collection. The long living ones are subject of the Gen 2 collection, which usually is very time-consuming. Large objects (size &gt;= 85 000 bytes) - allocated in the Large Object Heap (LOH). Managed with the free list algorithm, which offers slower allocation and can lead to memory fragmentation. The advantage is that large objects are by default never copied. This behavior can be changed on demand. LOH has very expensive deallocation (Full GC) which can be minimized by using ArrayPool.]]></summary></entry><entry><title type="html">ref returns and ref locals</title><link href="https://adamsitnik.com/ref-returns-and-ref-locals/" rel="alternate" type="text/html" title="ref returns and ref locals" /><published>2017-07-04T00:00:00+00:00</published><updated>2017-07-04T00:00:00+00:00</updated><id>https://adamsitnik.com/ref-returns-and-ref-locals</id><content type="html" xml:base="https://adamsitnik.com/ref-returns-and-ref-locals/"><![CDATA[<p>tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even <strong>faster</strong> than <code class="language-plaintext highlighter-rouge">unsafe</code>!</p>

<h2 id="introduction">Introduction</h2>

<p>Since C# 1.0 we could pass arguments to methods by reference. It means that instead of copying value types every time we pass them to a method we can just pass them by reference. It allows us to overcome one of the very few disadvantages of value types which I described in my <a href="https://adamsitnik.com/Value-Types-vs-Reference-Types/">previous</a> blog post “Value Types vs Reference Types”.</p>

<p>Passing is not enough to cover all scenarios. C# 7.0 adds new possibilities: declaring references to local variables and returning by reference from methods.</p>

<p><strong>Note:</strong> I want to focus on the performance aspect here. If you want to learn more about <code class="language-plaintext highlighter-rouge">ref returns and ref locals</code> you should read these awesome <a href="https://mustoverride.com/tags/#refs">blog posts</a> from Vladimir Sadov. He is the software engineer who has implemented this feature for C# compiler. So you can get it straight from the horse’s mouth!</p>

<h3 id="reminder">Reminder</h3>

<p>Let’s analyse some simple C# examples to make sure that we have good common understanding of the syntax.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">void method(ref int argument)</code> - The argument is passed to the method by <strong>ref</strong>erence.
    <div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">localVariable</span> <span class="p">=</span> <span class="m">123</span><span class="p">;</span>
<span class="k">ref</span> <span class="kt">int</span> <span class="n">localReference</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">localVariable</span><span class="p">;</span>
</code></pre></div>    </div>
  </li>
  <li><code class="language-plaintext highlighter-rouge">ref localVariable</code> - De<strong>ref</strong>erencing local variable. If you have C++ background you can think of it as of <code class="language-plaintext highlighter-rouge">*localVariable</code></li>
  <li><code class="language-plaintext highlighter-rouge">ref int localReference</code> - Defining local <strong>ref</strong>erence. <code class="language-plaintext highlighter-rouge">localReference</code> is an alias of an existing variable.</li>
  <li><code class="language-plaintext highlighter-rouge">ref array[0]</code> - De<strong>ref</strong>erencing array’s first element.</li>
  <li><code class="language-plaintext highlighter-rouge">ref int method()</code> - The result of the method is passed by <strong>ref</strong>erence. The method still returns an int. <a href="https://mustoverride.com/refs-not-ptrs/">Not a pointer!</a></li>
</ul>

<!--more-->

<h2 id="passing-arguments-to-methods-by-reference">Passing arguments to methods by reference</h2>

<p>In C# all parameters are passed to methods by value by default. It means that the Value Type instance is copied every time we pass it to a method. Or when we return it from a method. The bigger the Value Type is, the more expensive it is to copy it. For small value types, the JIT compiler might optimize the copying (inline the method, use registers for copying &amp; more).</p>

<p>We can pass arguments to methods by reference. It’s not a new feature, it was part of C# 1.0. Anyway, I am going to measure it to make sure that it actually improves the performance. Once again I am using <a href="https://benchmarkdotnet.org/">BenchmarkDotNet</a> for benchmarking.</p>

<h3 id="benchmarks">Benchmarks</h3>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">LegacyJitX86Job</span><span class="p">,</span> <span class="n">LegacyJitX64Job</span><span class="p">,</span> <span class="n">RyuJitX64Job</span><span class="p">]</span> <span class="c1">// run the benchmarks for all available jits</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">PassingByReference</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="nc">BigStruct</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="kt">int</span> <span class="n">Int1</span><span class="p">,</span> <span class="n">Int2</span><span class="p">,</span> <span class="n">Int3</span><span class="p">,</span> <span class="n">Int4</span><span class="p">,</span> <span class="n">Int5</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="n">BigStruct</span> <span class="n">field</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">BigStruct</span><span class="p">();</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">,</span> <span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">PassByValue</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">copy</span> <span class="p">=</span> <span class="n">field</span><span class="p">;</span> <span class="c1">// access the field only once to not influence the benchmark too much</span>
        <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="n">copy</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">PassByReference</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="k">ref</span> <span class="kt">var</span> <span class="n">local</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">field</span><span class="p">;</span> <span class="c1">// access the field only once to not influence the benchmark too much</span>
        <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span>
        <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">local</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span>
    <span class="k">void</span> <span class="nf">Method</span><span class="p">(</span><span class="n">BigStruct</span> <span class="k">value</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span>
    <span class="k">void</span> <span class="nf">Method</span><span class="p">(</span><span class="k">ref</span> <span class="n">BigStruct</span> <span class="k">value</span><span class="p">)</span> <span class="p">{</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="results">Results</h3>

<div class="language-ini highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="py">BenchmarkDotNet</span><span class="p">=</span><span class="s">v0.10.8, OS=Windows 8.1 (6.3.9600)</span>
<span class="py">Processor</span><span class="p">=</span><span class="s">Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8</span>
<span class="py">Frequency</span><span class="p">=</span><span class="s">2338342 Hz, Resolution=427.6534 ns, Timer=TSC</span>
  <span class="nn">[Host]</span>       <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>
  <span class="err">LegacyJitX64</span> <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">LegacyJIT/clrjit-v4.7.2053.0</span><span class="c">;compatjit-v4.7.2053.0
</span>  <span class="err">LegacyJitX86</span> <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">32bit</span> <span class="err">LegacyJIT-v4.7.2053.0</span>
  <span class="err">RyuJitX64</span>    <span class="err">:</span> <span class="err">Clr</span> <span class="err">4.0.30319.42000,</span> <span class="err">64bit</span> <span class="err">RyuJIT-v4.7.2053.0</span>

<span class="py">Runtime</span><span class="p">=</span><span class="s">Clr  </span>
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Jit</th>
      <th>Platform</th>
      <th style="text-align: right">Mean</th>
      <th style="text-align: right">Scaled</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>PassByValue</td>
      <td>LegacyJit</td>
      <td>X86</td>
      <td style="text-align: right">2.868 ns</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>PassByReference</td>
      <td>LegacyJit</td>
      <td>X86</td>
      <td style="text-align: right">1.434 ns</td>
      <td style="text-align: right"><strong>0.50</strong></td>
    </tr>
  </tbody>
</table>

<p>As you can see the 32bit JIT is struggling with copying large value types. Passing them by reference gave us x2 speed up in this scenario. But we have measured only the time required to pass an argument to a method. If a method is complex and time-consuming itself, the performance improvement might be very small. The smaller the method, the bigger improvement you will see.</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Jit</th>
      <th>Platform</th>
      <th style="text-align: right">Mean</th>
      <th style="text-align: right">Scaled</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>PassByValue</td>
      <td>LegacyJit</td>
      <td>X64</td>
      <td style="text-align: right">2.062 ns</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>PassByReference</td>
      <td>LegacyJit</td>
      <td>X64</td>
      <td style="text-align: right">1.470 ns</td>
      <td style="text-align: right"><strong>0.71</strong></td>
    </tr>
    <tr>
      <td>PassByValue</td>
      <td>RyuJit</td>
      <td>X64</td>
      <td style="text-align: right">2.098 ns</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>PassByReference</td>
      <td>RyuJit</td>
      <td>X64</td>
      <td style="text-align: right">1.593 ns</td>
      <td style="text-align: right"><strong>0.76</strong></td>
    </tr>
  </tbody>
</table>

<p>For 64 bit the difference is smaller but still noticeable.</p>

<p>We can say that passing argument by reference can bring you some benefits, but you should not expect x10 time improvement. The code gets also more complex. Measure your scenario and if you prove that it brings you worthy performance improvement then use it.</p>

<h2 id="local-references">Local references</h2>

<p>Using local references is another way to avoid copying of memory. Let’s try to initialize an array of large value types and see how fast we can get with <code class="language-plaintext highlighter-rouge">ref locals</code>.</p>

<h3 id="benchmarks-1">Benchmarks</h3>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">LegacyJitX86Job</span><span class="p">,</span> <span class="n">LegacyJitX64Job</span><span class="p">,</span> <span class="n">RyuJitX64Job</span><span class="p">]</span>
<span class="p">[</span><span class="n">RPlotExporter</span><span class="p">]</span> <span class="c1">// use R to get nice charts!</span>
<span class="p">[</span><span class="n">CsvMeasurementsExporter</span><span class="p">]</span> <span class="c1">// use R to get nice charts!</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">InitializingBigStructs</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="nc">BigStruct</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="kt">int</span> <span class="n">Int1</span><span class="p">,</span> <span class="n">Int2</span><span class="p">,</span> <span class="n">Int3</span><span class="p">,</span> <span class="n">Int4</span><span class="p">,</span> <span class="n">Int5</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="n">BigStruct</span><span class="p">[]</span> <span class="n">array</span><span class="p">;</span>

    <span class="p">[</span><span class="n">GlobalSetup</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">Setup</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">array</span> <span class="p">=</span> <span class="k">new</span> <span class="n">BigStruct</span><span class="p">[</span><span class="m">1000</span><span class="p">];</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ByValue</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">BigStruct</span><span class="p">[]</span> <span class="n">variable</span> <span class="p">=</span> <span class="n">array</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">variable</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="n">BigStruct</span> <span class="k">value</span> <span class="p">=</span> <span class="n">variable</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="c1">// copy the value 1st time</span>

            <span class="k">value</span><span class="p">.</span><span class="n">Int1</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
            <span class="k">value</span><span class="p">.</span><span class="n">Int2</span> <span class="p">=</span> <span class="m">2</span><span class="p">;</span>
            <span class="k">value</span><span class="p">.</span><span class="n">Int3</span> <span class="p">=</span> <span class="m">3</span><span class="p">;</span>
            <span class="k">value</span><span class="p">.</span><span class="n">Int4</span> <span class="p">=</span> <span class="m">4</span><span class="p">;</span>
            <span class="k">value</span><span class="p">.</span><span class="n">Int5</span> <span class="p">=</span> <span class="m">5</span><span class="p">;</span>

            <span class="n">variable</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="p">=</span> <span class="k">value</span><span class="p">;</span> <span class="c1">// copy the value 2nd time</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ByReference</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">BigStruct</span><span class="p">[]</span> <span class="n">variable</span> <span class="p">=</span> <span class="n">array</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">variable</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">ref</span> <span class="n">BigStruct</span> <span class="n">reference</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">variable</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> <span class="c1">// create local alias to array storage</span>

            <span class="n">reference</span><span class="p">.</span><span class="n">Int1</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
            <span class="n">reference</span><span class="p">.</span><span class="n">Int2</span> <span class="p">=</span> <span class="m">2</span><span class="p">;</span>
            <span class="n">reference</span><span class="p">.</span><span class="n">Int3</span> <span class="p">=</span> <span class="m">3</span><span class="p">;</span>
            <span class="n">reference</span><span class="p">.</span><span class="n">Int4</span> <span class="p">=</span> <span class="m">4</span><span class="p">;</span>
            <span class="n">reference</span><span class="p">.</span><span class="n">Int5</span> <span class="p">=</span> <span class="m">5</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Note:</strong> This scenario could have been handled without ref locals. We could simply pass the argument by reference like this:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">void</span> <span class="nf">ByReferenceOldWay</span><span class="p">()</span>
<span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
    <span class="p">{</span>
        <span class="nf">Init</span><span class="p">(</span><span class="k">ref</span> <span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// try it with: [MethodImpl(MethodImplOptions.NoInlining)]</span>
<span class="k">private</span> <span class="k">void</span> <span class="nf">Init</span><span class="p">(</span><span class="k">ref</span> <span class="n">BigStruct</span> <span class="n">reference</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">reference</span><span class="p">.</span><span class="n">Int1</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
    <span class="n">reference</span><span class="p">.</span><span class="n">Int2</span> <span class="p">=</span> <span class="m">2</span><span class="p">;</span>
    <span class="n">reference</span><span class="p">.</span><span class="n">Int3</span> <span class="p">=</span> <span class="m">3</span><span class="p">;</span>
    <span class="n">reference</span><span class="p">.</span><span class="n">Int4</span> <span class="p">=</span> <span class="m">4</span><span class="p">;</span>
    <span class="n">reference</span><span class="p">.</span><span class="n">Int5</span> <span class="p">=</span> <span class="m">5</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="results-1">Results</h3>

<p>This time I have used <code class="language-plaintext highlighter-rouge">RPlotExporter</code> which produces some fancy charts if you have <code class="language-plaintext highlighter-rouge">R</code> installed and <code class="language-plaintext highlighter-rouge">R_HOME</code> environment variable configured. More info can be obtained <a href="https://benchmarkdotnet.org/Configs/Exporters.htm#plots">here</a>.</p>

<p class="center"><img src="/images/references/InitializingBigStructs-barplot.png" alt="Initializing Big Structs" /></p>

<p>As you can see the difference is HUGE. In this simple scenario, we got x4.51 performance improvement for RyuJit! 5.58 for LegacyJitX64 and 5.05 for LegacyJitX86. It’s clear that this feature can be very useful in similar scenarios.</p>

<p>But how is it possible that the new <code class="language-plaintext highlighter-rouge">ref locals</code> feature works with the legacy Jits? Have Microsoft released a Windows patch with .NET framework improvements? No! C# features like ref parameters, locals and returns are just using the existing feature of CLR called <a href="https://mustoverride.com/managed-refs-CLR/">managed pointers</a>. So to use it you just need IDE with Roslyn 2.0 version (Visual Studio 2017 or Rider) and you can deploy the code to your client’s old virtual machine which might be running with some very old .NET framework ;)</p>

<h2 id="returning-references">Returning references</h2>

<p>Returning results by reference can bring us performance gains similar to passing arguments by reference. This is why I am not going to run any benchmarks here. The very important thing is that they can help us to make things like <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code> come true. I’ll describe <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code> in my next blog post. Stay tuned!</p>

<h2 id="safety">Safety</h2>

<p>As you most probably know C# allows us to use <code class="language-plaintext highlighter-rouge">unsafe</code> <code class="language-plaintext highlighter-rouge">C++</code>-like pointers. Unsafe code can not pass the IL verification, which is one of the CLR mechanisms that ensure type and memory safety (<a href="https://docs.microsoft.com/en-us/dotnet/framework/tools/peverify-exe-peverify-tool">PEVerify</a>). This is why the code that is using <code class="language-plaintext highlighter-rouge">unsafe</code> requires <a href="https://stackoverflow.com/a/706578/5852046">FullTrust</a> to be executed. It might be not an option in some environments. I remember that long time ago the default settings in Azure were not allowing <code class="language-plaintext highlighter-rouge">unsafe</code> code to run. This is why a lot of common high-performance .NET libraries are not using <code class="language-plaintext highlighter-rouge">unsafe</code>. <code class="language-plaintext highlighter-rouge">mscorlib.dll</code> is using <code class="language-plaintext highlighter-rouge">unsafe</code>, but it is exceptionally not verified during runtime ;)</p>

<p>Let’s compare the performance of safe vs unsafe in the scenario described previously.</p>

<h3 id="benchmarks-2">Benchmarks</h3>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
<span class="k">public</span> <span class="k">unsafe</span> <span class="k">void</span> <span class="nf">ByReferenceUnsafe</span><span class="p">()</span>
<span class="p">{</span>
    <span class="n">BigStruct</span><span class="p">[]</span> <span class="n">variable</span> <span class="p">=</span> <span class="n">array</span><span class="p">;</span>
    <span class="k">fixed</span> <span class="p">(</span><span class="n">BigStruct</span><span class="p">*</span> <span class="n">pinned</span> <span class="p">=</span> <span class="n">variable</span><span class="p">)</span>
    <span class="p">{</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">variable</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="n">BigStruct</span><span class="p">*</span> <span class="n">pointer</span> <span class="p">=</span> <span class="p">&amp;</span><span class="n">pinned</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="p">(*</span><span class="n">pointer</span><span class="p">).</span><span class="n">Int1</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
            <span class="p">(*</span><span class="n">pointer</span><span class="p">).</span><span class="n">Int2</span> <span class="p">=</span> <span class="m">2</span><span class="p">;</span>
            <span class="p">(*</span><span class="n">pointer</span><span class="p">).</span><span class="n">Int3</span> <span class="p">=</span> <span class="m">3</span><span class="p">;</span>
            <span class="p">(*</span><span class="n">pointer</span><span class="p">).</span><span class="n">Int4</span> <span class="p">=</span> <span class="m">4</span><span class="p">;</span>
            <span class="p">(*</span><span class="n">pointer</span><span class="p">).</span><span class="n">Int5</span> <span class="p">=</span> <span class="m">5</span><span class="p">;</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="results-2">Results</h3>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th>Jit</th>
      <th>Platform</th>
      <th style="text-align: right">Mean</th>
      <th style="text-align: right">Scaled</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ByReference</td>
      <td>LegacyJit</td>
      <td>X64</td>
      <td style="text-align: right">1.649 us</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>ByReferenceUnsafe</td>
      <td>LegacyJit</td>
      <td>X64</td>
      <td style="text-align: right">1.721 us</td>
      <td style="text-align: right"><strong>1.04</strong></td>
    </tr>
    <tr>
      <td>ByReference</td>
      <td>LegacyJit</td>
      <td>X86</td>
      <td style="text-align: right"><strong>1.666</strong> us</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>ByReferenceUnsafe</td>
      <td>LegacyJit</td>
      <td>X86</td>
      <td style="text-align: right"><strong>1.673</strong> us</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>ByReference</td>
      <td>RyuJit</td>
      <td>X64</td>
      <td style="text-align: right">1.684 us</td>
      <td style="text-align: right">1.00</td>
    </tr>
    <tr>
      <td>ByReferenceUnsafe</td>
      <td>RyuJit</td>
      <td>X64</td>
      <td style="text-align: right">1.709 us</td>
      <td style="text-align: right"><strong>1.02</strong></td>
    </tr>
  </tbody>
</table>

<p>To our surprise, the safe way is faster than unsafe! Why is that?</p>

<p>When we are using safe references, we don’t need to pin objects in memory. GC understand managed pointers and knows how to update them when it’s compacting the memory. With <code class="language-plaintext highlighter-rouge">unsafe</code> this is not true, the managed memory needs to be pinned before it can be used.  As mentioned by <a href="https://twitter.com/buybackoff">Victor Baybekov</a> on Twitter, using <code class="language-plaintext highlighter-rouge">fixed</code> keyword prevent from inlining.</p>

<p><strong>Note:</strong> This micro-benchmark is not including the side effects of pinning memory. If you pin many managed arrays in memory, then the GC has a lot of extra work to do when it’s compacting the memory. Once again I will redirect you to <a href="https://www.amazon.com/dp/1430244585">Pro .NET Performance</a> book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow which has a whole chapter dedicated to Garbage Collection in .NET.</p>

<h2 id="managed-pointers-arithmetic">Managed pointers arithmetic</h2>

<p>C# 7.0 is not exposing managed pointers arithmetic, which is a part of the IL language. But the <code class="language-plaintext highlighter-rouge">System.Runtime.CompilerServices.Unsafe</code> class does. You can use it whenever you want to compare the references, move them by given offset etc.</p>

<p>You can find it the <code class="language-plaintext highlighter-rouge">System.Runtime.CompilerServices.Unsafe</code> <a href="https://www.nuget.org/packages/System.Runtime.CompilerServices.Unsafe/4.3.0">NuGet package</a>. It targets .NET Standard 1.0 so you can use it in both .NET 4.5+ and .NET Core 1.0 apps. Not to speak about other frameworks that implement the standard. The api is following:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">namespace</span> <span class="nn">System.Runtime.CompilerServices</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="k">static</span> <span class="k">partial</span> <span class="k">class</span> <span class="nc">Unsafe</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">AddByteOffset</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="n">System</span><span class="p">.</span><span class="n">IntPtr</span> <span class="n">byteOffset</span><span class="p">)</span> 
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">Add</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementOffset</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">Add</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="n">System</span><span class="p">.</span><span class="n">IntPtr</span> <span class="n">elementOffset</span><span class="p">)</span> 
        <span class="k">public</span> <span class="k">static</span> <span class="kt">bool</span> <span class="n">AreSame</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">left</span><span class="p">,</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">right</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">unsafe</span> <span class="k">static</span> <span class="k">void</span><span class="p">*</span> <span class="n">AsPointer</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="k">value</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">unsafe</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">AsRef</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">void</span><span class="p">*</span> <span class="n">source</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">static</span> <span class="n">T</span> <span class="n">As</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="kt">object</span> <span class="n">o</span><span class="p">)</span> <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="k">class</span>
        <span class="nc">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">TTo</span> <span class="n">As</span><span class="p">&lt;</span><span class="n">TFrom</span><span class="p">,</span> <span class="n">TTo</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">TFrom</span> <span class="n">source</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">static</span> <span class="n">System</span><span class="p">.</span><span class="n">IntPtr</span> <span class="n">ByteOffset</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">origin</span><span class="p">,</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">target</span><span class="p">)</span> 
        <span class="k">public</span> <span class="k">static</span> <span class="kt">int</span> <span class="n">SizeOf</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;()</span>
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">SubtractByteOffset</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="n">System</span><span class="p">.</span><span class="n">IntPtr</span> <span class="n">byteOffset</span><span class="p">)</span> 
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">Subtract</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="kt">int</span> <span class="n">elementOffset</span><span class="p">)</span>
        <span class="k">public</span> <span class="k">static</span> <span class="k">ref</span> <span class="n">T</span> <span class="n">Subtract</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="k">ref</span> <span class="n">T</span> <span class="n">source</span><span class="p">,</span> <span class="n">System</span><span class="p">.</span><span class="n">IntPtr</span> <span class="n">elementOffset</span><span class="p">)</span> 
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Note:</strong> it’s not full api. Some methods were removed for brevity. You can find the single <code class="language-plaintext highlighter-rouge">.il</code> file that contains the implementation in the <a href="https://github.com/dotnet/corefx/blob/master/src/System.Runtime.CompilerServices.Unsafe/src/System.Runtime.CompilerServices.Unsafe.il">corefx repo</a></p>

<h2 id="current-limitations">Current limitations</h2>

<p>C# 7.0, which is the current version for C# language as of today (3rd of July 2017) does not allow to:</p>

<ul>
  <li>use <code class="language-plaintext highlighter-rouge">readonly</code> references (you can not dereference a <code class="language-plaintext highlighter-rouge">readonly</code> field)</li>
  <li>treat <code class="language-plaintext highlighter-rouge">this</code> as readonly references for <code class="language-plaintext highlighter-rouge">readonly</code> structs</li>
  <li>define <code class="language-plaintext highlighter-rouge">by-ref</code> fields</li>
  <li>define <code class="language-plaintext highlighter-rouge">by-ref</code> extension methods. It’s possible today with Visual Basic!</li>
  <li>use conditional operator with refs (<code class="language-plaintext highlighter-rouge">condition ? ref left : ref right</code>)</li>
</ul>

<p>Hopefully all these limitations are going to be addressed by C# 7.2. You can follow the issues today on GitHub:</p>

<ul>
  <li><a href="https://github.com/dotnet/csharplang/issues/38">Champion “Readonly ref”</a></li>
  <li><a href="https://github.com/dotnet/csharplang/issues/188">Champion “readonly for locals and parameters”</a></li>
  <li><a href="https://github.com/dotnet/csharplang/issues/186">Champion “ref extension methods on structs”</a></li>
  <li><a href="https://github.com/dotnet/csharplang/issues/223">Champion “conditional ref operator”</a></li>
</ul>

<p>Don’t forget to support the new C# 7.2 features! Thumbs up!</p>

<p class="center"><img src="/images/references/thumbsup.png" alt="Thumbs up" /></p>

<h2 id="summary">Summary</h2>

<ul>
  <li>C# features like ref parameters, locals and returns can help us to avoid copying the memory.</li>
  <li>They are using an existing CLR feature called “managed pointers”. You need modern IDE to use them, but they will run anywhere.</li>
  <li>Using them is safe and can be faster than using <code class="language-plaintext highlighter-rouge">unsafe</code> pointers.</li>
  <li>We need to measure and prove that using them is beneficial before we increase the complexity of our code.</li>
  <li>There are many limitations as of today. Some of them will be addressed by C# 7.2. Some are already solved by the <code class="language-plaintext highlighter-rouge">System.Runtime.CompilerServices.Unsafe</code> api.</li>
</ul>

<h2 id="sources">Sources</h2>

<ul>
  <li><a href="https://mustoverride.com/refs-not-ptrs/">ref returns are not pointers</a> blog post by Vladimir Sadov</li>
  <li><a href="https://mustoverride.com/refs-not-ptrs/">Managed pointers</a> blog post by Vladimir Sadov</li>
  <li><a href="https://mustoverride.com/ref-returns-and-locals/">Local variables cannot be returned by reference</a> blog post by Vladimir Sadov</li>
  <li><a href="https://mustoverride.com/safe-to-return/">Safe to return rules for ref returns.</a> blog post by Vladimir Sadov</li>
  <li><a href="https://mustoverride.com/ref-locals_single-assignment/">Why ref locals allow only a single binding?</a> blog post by Vladimir Sadov</li>
  <li><a href="https://blog.marcgravell.com/2017/04/spans-and-ref-part-1-ref.html">Spans and ref part 1 : ref</a> blog post by Marc Gravell</li>
  <li><a href="https://stackoverflow.com/a/706578/5852046">What are the implications of using unsafe code?</a> Stack Overflow answer by Jared Par</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than unsafe! Introduction Since C# 1.0 we could pass arguments to methods by reference. It means that instead of copying value types every time we pass them to a method we can just pass them by reference. It allows us to overcome one of the very few disadvantages of value types which I described in my previous blog post “Value Types vs Reference Types”. Passing is not enough to cover all scenarios. C# 7.0 adds new possibilities: declaring references to local variables and returning by reference from methods. Note: I want to focus on the performance aspect here. If you want to learn more about ref returns and ref locals you should read these awesome blog posts from Vladimir Sadov. He is the software engineer who has implemented this feature for C# compiler. So you can get it straight from the horse’s mouth! Reminder Let’s analyse some simple C# examples to make sure that we have good common understanding of the syntax. void method(ref int argument) - The argument is passed to the method by reference. int localVariable = 123; ref int localReference = ref localVariable; ref localVariable - Dereferencing local variable. If you have C++ background you can think of it as of *localVariable ref int localReference - Defining local reference. localReference is an alias of an existing variable. ref array[0] - Dereferencing array’s first element. ref int method() - The result of the method is passed by reference. The method still returns an int. Not a pointer!]]></summary></entry><entry><title type="html">Value Types vs Reference Types</title><link href="https://adamsitnik.com/Value-Types-vs-Reference-Types/" rel="alternate" type="text/html" title="Value Types vs Reference Types" /><published>2017-06-26T00:00:00+00:00</published><updated>2017-06-26T00:00:00+00:00</updated><id>https://adamsitnik.com/Value-Types-vs-Reference-Types</id><content type="html" xml:base="https://adamsitnik.com/Value-Types-vs-Reference-Types/"><![CDATA[<p>tl;dr <code class="language-plaintext highlighter-rouge">structs</code> have better data locality. Value types add much less pressure for the GC than reference types. But big value types are expensive to copy and you can accidentally box them which is bad.</p>

<h3 id="introduction">Introduction</h3>

<p>The .NET framework implements Reference Types and Value Types. <code class="language-plaintext highlighter-rouge">C#</code> allows us to define custom value types by using <code class="language-plaintext highlighter-rouge">struct</code> and <code class="language-plaintext highlighter-rouge">enum</code> keywords. <code class="language-plaintext highlighter-rouge">class</code>, <code class="language-plaintext highlighter-rouge">delegate</code> and <code class="language-plaintext highlighter-rouge">interface</code> are for reference types. Primitive types, like <code class="language-plaintext highlighter-rouge">byte</code>, <code class="language-plaintext highlighter-rouge">char</code>, <code class="language-plaintext highlighter-rouge">short</code>, <code class="language-plaintext highlighter-rouge">int</code> and <code class="language-plaintext highlighter-rouge">long</code> are value types, but developers can’t define custom primitive types. In Java primitive types are also value types, but Java does not expose a possibility to define custom value types for developers ;)</p>

<p><strong>Value Types and Reference Types are very different in terms of performance characteristics</strong>. In my next blog posts, I am going to describe <code class="language-plaintext highlighter-rouge">ref returns and locals</code>, <code class="language-plaintext highlighter-rouge">ValueTask&lt;T&gt;</code> and <code class="language-plaintext highlighter-rouge">Span&lt;T&gt;</code>. But I need to clarify this matter first, so the readers can understand the benefits.</p>

<p><strong>Note:</strong> To keep my comparison simple I am going to use <code class="language-plaintext highlighter-rouge">ValueTuple&lt;int, int&gt;</code> and <code class="language-plaintext highlighter-rouge">Tuple&lt;int, int&gt;</code> as the examples.</p>

<h2 id="memory-layout">Memory Layout</h2>

<p><strong>Every instance of a reference type has extra two fields that are used internally by CLR.</strong></p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">ObjectHeader</code> is a bitmask, which is used by CLR to store some additional information. For example: if you take a lock on a given object instance, this information is stored in <code class="language-plaintext highlighter-rouge">ObjectHeader</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">MethodTable</code> is a pointer to the Method Table, which is a set of metadata about given type. If you call a virtual method, then CLR jumps to the Method Table and obtains the address of the actual implementation and performs the actual call.</li>
</ul>

<p>Both hidden fields size is equal to the size of a pointer. So for <code class="language-plaintext highlighter-rouge">32 bit</code> architecture, we have 8 bytes overhead and for <code class="language-plaintext highlighter-rouge">64 bit</code> 16 bytes.</p>

<p class="center"><img src="/images/valueTypesVsReferenceTypes/ReferenceTypes_MemoryLayout.png" alt="Reference Type Memory Layout" /></p>

<p><strong>Value Types don’t have any additional overhead members</strong>. What you see is what you get. This is why they are more limited in terms of features. You cannot derive from <code class="language-plaintext highlighter-rouge">struct</code>, <code class="language-plaintext highlighter-rouge">lock</code> it or write finalizer for it.</p>

<p class="center"><img src="/images/valueTypesVsReferenceTypes/ValueTypes_MemoryLayout.png" alt="Value Type Memory Layout" /></p>

<p>RAM is very cheap. So, what’s all the fuss about?</p>

<!--more-->

<h3 id="cpu-cache">CPU Cache</h3>

<p>CPU implements numerous performance optimizations. One of them is cache, which is just a memory with the most recently used data.</p>

<p class="center"><img src="/images/valueTypesVsReferenceTypes/Cache.png" alt="How CPU Cache work" /></p>

<p><strong>Note:</strong> Multithreading affects CPU cache performance. In order to make it easier to understand, the following description assumes single core.</p>

<p>Whenever you try to read a value, CPU checks the first level of cache (L1). If it’s a <strong>hit</strong>, the value is being returned. Otherwise, it checks the second level of cache (L2). If the value is there, it’s being copied to L1 and returned. Otherwise, it checks L3 (if it’s present).</p>

<p>If the data is not in the cache, CPU goes to the main memory and copies it to the cache. This is called <strong>cache miss</strong>.</p>

<h3 id="latency-numbers-every-programmer-should-know">Latency Numbers Every Programmer Should Know</h3>

<p>According to <a href="https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html">Latency Numbers Every Programmer Should Know</a> going to main memory is really expensive when compared to referencing cache.</p>

<table>
  <thead>
    <tr>
      <th>Operation</th>
      <th>Time</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>L1 cache reference</td>
      <td>1ns</td>
    </tr>
    <tr>
      <td>L2 cache reference</td>
      <td>4ns</td>
    </tr>
    <tr>
      <td>Main memory reference</td>
      <td>100 ns</td>
    </tr>
  </tbody>
</table>

<p>So how can we reduce the ratio of cache misses?</p>

<h3 id="data-locality">Data Locality</h3>

<p>CPU is smart, it’s aware of the following data locality principles:</p>

<ul>
  <li>Spatial
    <blockquote>
      <p>If a particular storage location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future.</p>
    </blockquote>
  </li>
  <li>Temporal
    <blockquote>
      <p>If at one point a particular memory location is referenced, then it is likely that the same location will be referenced again in the near future.</p>
    </blockquote>
  </li>
</ul>

<p>CPU is taking advantage of this knowledge. Whenever CPU copies a value from main memory to cache, it is copying whole <strong>cache line</strong>, not just the value. A cache line is usually 64 bytes. So it is well prepared in case you ask for the nearby memory location.</p>

<h3 id="the-net-story">The .NET Story</h3>

<p>How the two extra fields per every reference type instance affect data locality? Let’s take a look at the following diagram which shows how many instances of <code class="language-plaintext highlighter-rouge">ValueTuple&lt;int, int&gt;</code> and <code class="language-plaintext highlighter-rouge">Tuple&lt;int, int&gt;</code> can fit into single cache line for <code class="language-plaintext highlighter-rouge">64bit</code> architecture.</p>

<p class="center"><img src="/images/valueTypesVsReferenceTypes/CacheLines.png" alt="CPU Cache Line" /></p>

<p>For this simple example, the difference is really huge. In our case, we could fit 8 instances of value type and 2.66 reference type.</p>

<h3 id="benchmarks">Benchmarks!</h3>

<p>It’s important to know the theory, but we need to run some benchmarks to measure the performance difference. Once again I am using <code class="language-plaintext highlighter-rouge">BenchmarkDotNet</code> and its feature called <code class="language-plaintext highlighter-rouge">HardwareCounters</code> which allows me to track CPU Cache Misses. <a href="https://adamsitnik.com/Hardware-Counters-Diagnoser/">Here</a> you can find my blog post about Collecting Hardware Performance Counters with BenchmarkDotNet.
The benchmark is a simple loop with read access in it’s every iteration. I would say that it’s just a CPU cache benchmark.</p>

<p><strong>Note</strong>: This benchmark is not a real life scenario. In real life, your struct would most probably be bigger (usually two fields is not enough). Hence the extra overhead of two fields for reference types would have a smaller performance impact. Smaller but still significant in high-performance scenarios!</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
    <span class="k">static</span> <span class="k">void</span> <span class="nf">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">BenchmarkRunner</span><span class="p">.</span><span class="n">Run</span><span class="p">&lt;</span><span class="n">DataLocality</span><span class="p">&gt;();</span>
<span class="p">}</span>

<span class="p">[</span><span class="nf">HardwareCounters</span><span class="p">(</span><span class="n">HardwareCounter</span><span class="p">.</span><span class="n">CacheMisses</span><span class="p">)]</span>
<span class="p">[</span><span class="n">RyuJitX64Job</span><span class="p">,</span> <span class="n">LegacyJitX86Job</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">DataLocality</span>
<span class="p">{</span>
    <span class="p">[</span><span class="nf">Params</span><span class="p">(</span>
        <span class="m">100</span><span class="p">,</span>
        <span class="m">1000000</span><span class="p">,</span>
        <span class="m">10000000</span><span class="p">,</span>
        <span class="m">100000000</span><span class="p">)]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="n">Count</span> <span class="p">{</span> <span class="k">get</span><span class="p">;</span> <span class="k">set</span><span class="p">;</span> <span class="p">}</span> <span class="c1">// for smaller arrays we don't get enough of Cache Miss events</span>

    <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;[]</span> <span class="n">arrayOfRef</span><span class="p">;</span>
    <span class="n">ValueTuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;[]</span> <span class="n">arrayOfVal</span><span class="p">;</span>

    <span class="p">[</span><span class="n">GlobalSetup</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">Setup</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="n">arrayOfRef</span> <span class="p">=</span> <span class="n">Enumerable</span><span class="p">.</span><span class="nf">Repeat</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">Count</span><span class="p">).</span><span class="nf">Select</span><span class="p">((</span><span class="n">val</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">Tuple</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="n">val</span><span class="p">,</span> <span class="n">index</span><span class="p">)).</span><span class="nf">ToArray</span><span class="p">();</span>
        <span class="n">arrayOfVal</span> <span class="p">=</span> <span class="n">Enumerable</span><span class="p">.</span><span class="nf">Repeat</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="n">Count</span><span class="p">).</span><span class="nf">Select</span><span class="p">((</span><span class="n">val</span><span class="p">,</span> <span class="n">index</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="k">new</span> <span class="n">ValueTuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;(</span><span class="n">val</span><span class="p">,</span> <span class="n">index</span><span class="p">)).</span><span class="nf">ToArray</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">IterateValueTypes</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">item1Sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">item2Sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>

        <span class="kt">var</span> <span class="n">array</span> <span class="p">=</span> <span class="n">arrayOfVal</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">ref</span> <span class="n">ValueTuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;</span> <span class="n">reference</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">item1Sum</span> <span class="p">+=</span> <span class="n">reference</span><span class="p">.</span><span class="n">Item1</span><span class="p">;</span>
            <span class="n">item2Sum</span> <span class="p">+=</span> <span class="n">reference</span><span class="p">.</span><span class="n">Item2</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="n">item1Sum</span> <span class="p">+</span> <span class="n">item2Sum</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="kt">int</span> <span class="nf">IterateReferenceTypes</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">int</span> <span class="n">item1Sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">,</span> <span class="n">item2Sum</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>

        <span class="kt">var</span> <span class="n">array</span> <span class="p">=</span> <span class="n">arrayOfRef</span><span class="p">;</span>
        <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p">&lt;</span> <span class="n">array</span><span class="p">.</span><span class="n">Length</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
        <span class="p">{</span>
            <span class="k">ref</span> <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;</span> <span class="n">reference</span> <span class="p">=</span> <span class="k">ref</span> <span class="n">array</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
            <span class="n">item1Sum</span> <span class="p">+=</span> <span class="n">reference</span><span class="p">.</span><span class="n">Item1</span><span class="p">;</span>
            <span class="n">item2Sum</span> <span class="p">+=</span> <span class="n">reference</span><span class="p">.</span><span class="n">Item2</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="n">item1Sum</span> <span class="p">+</span> <span class="n">item2Sum</span><span class="p">;</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="the-results">The Results</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th>Count</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Scaled</th>
        <th style="text-align: right">CacheMisses/Op</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>IterateValueTypes</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td>100</td>
        <td style="text-align: right">68.96 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>0</strong></td>
      </tr>
      <tr>
        <td>IterateReferenceTypes</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td>100</td>
        <td style="text-align: right">317.49 ns</td>
        <td style="text-align: right"><strong>4.60</strong></td>
        <td style="text-align: right"><strong>0</strong></td>
      </tr>
      <tr>
        <td>IterateValueTypes</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>100</td>
        <td style="text-align: right">76.56 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>0</strong></td>
      </tr>
      <tr>
        <td>IterateReferenceTypes</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>100</td>
        <td style="text-align: right">252.23 ns</td>
        <td style="text-align: right"><strong>3.29</strong></td>
        <td style="text-align: right"><strong>0</strong></td>
      </tr>
    </tbody>
  </table>

</div>
<p>As you can see the difference (Scaled column) is really significant!</p>

<p>But the <code class="language-plaintext highlighter-rouge">CacheMisses/Op</code> column is empty?!? What does it mean? In this case, it means that I run too few loop iterations (just 100).</p>

<p>An explanation for the curious: BenchmarkDotNet is using <a href="https://adamsitnik.com/Hardware-Counters-ETW/">ETW</a> to collect hardware counters. ETW is simply exposing what the hardware has to offer. Each Performance Monitoring Units (PMU) register is configured to count a specific event and given a sample-after value (SAV). For my PC the minimum Cache Miss HC sampling interval is 4000. In value type benchmark I should get Cache Miss once every 8 loop iterations (<code class="language-plaintext highlighter-rouge">cacheLineSize / sizeOf(ValueTuple&lt;int, int&gt;) = 64 / 8 = 8</code>). I have 100 iterations here, so it should be 12 Cache Misses for Benchmark. But the PMU will notify ETW, which will notify BenchmarkDotNet every 4 000 events. So once every 333 (<code class="language-plaintext highlighter-rouge">4 000 / 12</code>) benchmark invocation. BenchmarkDotNet implements a heuristic which decides how many times the benchmarked method should be invoked. It this example the method was executed too few times to capture enough of events. <strong>So if you want to capture some hardware counters with BenchmarkDotNet you need to perform plenty of iterations!</strong> For more info about PMU you can refer to <a href="https://software.intel.com/en-us/articles/understanding-how-general-exploration-works-in-intel-vtune-amplifier-xe">this article</a> by Jackson Marusarz (Intel).</p>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th>Count</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Scaled</th>
        <th style="text-align: right">CacheMisses/Op</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>IterateValueTypes</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>100 000 000</td>
        <td style="text-align: right">88,735,182.11 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>3545088</strong></td>
      </tr>
      <tr>
        <td>IterateReferenceTypes</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td>100 000 000</td>
        <td style="text-align: right">280,721,189.70 ns</td>
        <td style="text-align: right"><strong>3.16</strong></td>
        <td style="text-align: right"><strong>8456940</strong></td>
      </tr>
    </tbody>
  </table>

</div>
<p>The more loop iterations (Count column), the more Cache Misses events we get. <strong>For the iteration of reference types cache misses were 2.38 times more common</strong> (8456940 / 3545088).</p>

<p><strong>Note:</strong> Accuracy of Hardware Counters diagnoser in BenchmarkDotNet is limited by sampling frequency and additional code performed in the benchmarked process by our Engine. It’s good but not perfect. For more accurate results you should use some profilers like Intel VTune Amplifier.</p>

<h2 id="gc-impact">GC Impact</h2>

<p>Reference Types are always allocated on the managed heap (it may change in the <a href="https://xoofx.com/blog/2015/10/08/stackalloc-for-class-with-roslyn-and-coreclr/">future</a>). Heap is managed by Garbage Collector (GC). The allocation of heap memory is fast. <strong>The problem is that the deallocation is performed by non-deterministic GC</strong>. GC implements own heuristic which allows it to decide when to perform the cleanup. The cleanup itself takes some time. It means that you can not predict when the cleanup will take place and it adds extra overhead.</p>

<p>Value Types can be allocated both on the stack and the heap. Stack is not managed by GC. Anytime you declare a local value type variable it’s allocated on the stack. When method ends, the stack is being unwinded and the value is gone. <strong>This deallocation is super fast. And in overall we have less pressure for the GC!</strong> The pressure is not equal to zero because anyway, GC traverses stacks, so the deeper the stack the more work it might have.</p>

<p>But the Value Types can be also allocated on the managed heap. If you allocate an array of bytes, then the array is allocated on the managed heap. This content is transparent to GC. They are not reference type instances, so GC does not track them in any way. But when the small array of value types gets promoted to older GC generation, the content will be copied by the GC.</p>

<h3 id="benchmarks-1">Benchmarks</h3>

<p>Let’s run some benchmark that includes the cost of allocation and deallocation for Value Types and Reference Types.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="nf">Config</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">AllocationsConfig</span><span class="p">))]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">NoGC</span>
<span class="p">{</span>
    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="n">ValueTuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;</span> <span class="nf">CreateValueTuple</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">ValueTuple</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="n">Tuple</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">&gt;</span> <span class="nf">CreateTuple</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="n">Tuple</span><span class="p">.</span><span class="nf">Create</span><span class="p">(</span><span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">public</span> <span class="k">class</span> <span class="nc">AllocationsConfig</span> <span class="p">:</span> <span class="n">ManualConfig</span>
<span class="p">{</span>
    <span class="k">public</span> <span class="nf">AllocationsConfig</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">gcSettings</span> <span class="p">=</span> <span class="k">new</span> <span class="n">GcMode</span>
        <span class="p">{</span>
            <span class="n">Force</span> <span class="p">=</span> <span class="k">false</span> <span class="c1">// tell BenchmarkDotNet not to force GC collections after every iteration</span>
        <span class="p">};</span>

        <span class="k">const</span> <span class="kt">int</span> <span class="n">invocationCount</span> <span class="p">=</span> <span class="m">1</span> <span class="p">&lt;&lt;</span> <span class="m">20</span><span class="p">;</span> <span class="c1">// let's run it very fast, we are here only for the GC stats</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span>
            <span class="p">.</span><span class="n">RyuJitX64</span> <span class="c1">// 64 bit</span>
            <span class="p">.</span><span class="nf">WithInvocationCount</span><span class="p">(</span><span class="n">invocationCount</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">gcSettings</span><span class="p">.</span><span class="nf">UnfreezeCopy</span><span class="p">()));</span>
        <span class="nf">Add</span><span class="p">(</span><span class="n">Job</span>
            <span class="p">.</span><span class="n">LegacyJitX86</span> <span class="c1">// 32 bit</span>
            <span class="p">.</span><span class="nf">WithInvocationCount</span><span class="p">(</span><span class="n">invocationCount</span><span class="p">)</span>
            <span class="p">.</span><span class="nf">With</span><span class="p">(</span><span class="n">gcSettings</span><span class="p">.</span><span class="nf">UnfreezeCopy</span><span class="p">()));</span>

        <span class="nf">Add</span><span class="p">(</span><span class="n">MemoryDiagnoser</span><span class="p">.</span><span class="n">Default</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="the-results-1">The Results</h3>

<p>If you are not familiar with the output produced by BenchmarkDotNet with Memory Diagnoser enabled, you can read my <a href="https://adamsitnik.com/the-new-Memory-Diagnoser/#how-to-read-the-results">dedicated blog post</a> to find out how to read these results.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]     : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  Job-QZDRYZ : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  Job-XFJRTH : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  Force=False  InvocationCount=1048576  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th style="text-align: right">Gen 0</th>
        <th style="text-align: right">Allocated</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>CreateValueTuple</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right"><strong>-</strong></td>
        <td style="text-align: right"><strong>0 B</strong></td>
      </tr>
      <tr>
        <td>CreateTuple</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right"><strong>0.0050</strong></td>
        <td style="text-align: right"><strong>16 B</strong></td>
      </tr>
      <tr>
        <td>CreateValueTuple</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right"><strong>-</strong></td>
        <td style="text-align: right"><strong>0 B</strong></td>
      </tr>
      <tr>
        <td>CreateTuple</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right"><strong>0.0076</strong></td>
        <td style="text-align: right"><strong>24 B</strong></td>
      </tr>
    </tbody>
  </table>

</div>
<p>As you can see, creating Value Types means No GC (<code class="language-plaintext highlighter-rouge">-</code> in Gen 0 column).</p>

<p><strong>Note:</strong> If value type contains reference types GC will emit write barriers for write access to the reference fields. So No GC is not 100% true for value types that contain references.</p>

<h2 id="boxing">Boxing</h2>

<p>Whenever a reference is required value types are being boxed. When the CLR boxes a value type, it wraps the value inside a System.Object and stores it on the managed heap. <strong>GC tracks references to boxed Value Types!</strong> This is something you definitely want to avoid.</p>

<p>Obvious boxing example:</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">string</span> <span class="nf">CallToString</span><span class="p">(</span><span class="kt">object</span> <span class="n">input</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">input</span><span class="p">.</span><span class="nf">ToString</span><span class="p">();</span>

<span class="kt">int</span> <span class="k">value</span> <span class="p">=</span> <span class="m">123</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">text</span> <span class="p">=</span> <span class="nf">CallToString</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">CallToString</code> accepts <code class="language-plaintext highlighter-rouge">object</code>. CLR needs to box the value before passing it to this method. It’s clear when you analyse the IL code:</p>

<p class="center"><img src="/images/valueTypesVsReferenceTypes/boxing.png" alt="Boxing" /></p>

<p><strong>Note:</strong> You can use ReSharper’s <a href="https://blog.jetbrains.com/dotnet/2014/06/06/heap-allocations-viewer-plugin/">Heap Allocation Viewer plugin</a> to detect boxing in your code.</p>

<h3 id="invoking-interface-methods-with-value-types">Invoking interface methods with Value Types</h3>

<p>The previous example was obvious. But what happens when we try to pass a struct to a method that accepts interface instance? Let’s take a look.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">MemoryDiagnoser</span><span class="p">]</span>
<span class="p">[</span><span class="n">RyuJitX64Job</span><span class="p">,</span> <span class="n">LegacyJitX86Job</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">ValueTypeInvokingInterfaceMethod</span>
<span class="p">{</span>
    <span class="k">interface</span> <span class="nc">IInterface</span>
    <span class="p">{</span>
        <span class="k">void</span> <span class="nf">DoNothing</span><span class="p">();</span>
    <span class="p">}</span>

    <span class="k">class</span> <span class="nc">ReferenceTypeImplementingInterface</span> <span class="p">:</span> <span class="n">IInterface</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="k">void</span> <span class="nf">DoNothing</span><span class="p">()</span> <span class="p">{</span> <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">struct</span> <span class="nc">ValueTypeImplementingInterface</span> <span class="p">:</span> <span class="n">IInterface</span>
    <span class="p">{</span>
        <span class="k">public</span> <span class="k">void</span> <span class="nf">DoNothing</span><span class="p">()</span> <span class="p">{</span> <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">private</span> <span class="n">ReferenceTypeImplementingInterface</span> <span class="n">reference</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ReferenceTypeImplementingInterface</span><span class="p">();</span>
    <span class="k">private</span> <span class="n">ValueTypeImplementingInterface</span> <span class="k">value</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ValueTypeImplementingInterface</span><span class="p">();</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ValueType</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>

    <span class="p">[</span><span class="n">Benchmark</span><span class="p">]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ReferenceType</span><span class="p">()</span> <span class="p">=&gt;</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span>

    <span class="k">void</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">IInterface</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">.</span><span class="nf">DoNothing</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Scaled</th>
        <th style="text-align: right">Gen 0</th>
        <th style="text-align: right">Allocated</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>ValueType</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">5.738 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>0.0038</strong></td>
        <td style="text-align: right"><strong>12 B</strong></td>
      </tr>
      <tr>
        <td>ReferenceType</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">1.910 ns</td>
        <td style="text-align: right"><strong>0.33</strong></td>
        <td style="text-align: right">-</td>
        <td style="text-align: right">0 B</td>
      </tr>
      <tr>
        <td>ValueType</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">5.754 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right"><strong>0.0076</strong></td>
        <td style="text-align: right"><strong>24 B</strong></td>
      </tr>
      <tr>
        <td>ReferenceType</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">1.845 ns</td>
        <td style="text-align: right"><strong>0.32</strong></td>
        <td style="text-align: right">-</td>
        <td style="text-align: right">0 B</td>
      </tr>
    </tbody>
  </table>

</div>
<p>Once again we got into boxing. Did you expect it?!</p>

<h3 id="how-to-avoid-boxing-with-value-types-that-implement-interfaces">How to avoid boxing with value types that implement interfaces?</h3>

<p>We need to use generic constraints. The method should not accept <code class="language-plaintext highlighter-rouge">IInterface</code> but <code class="language-plaintext highlighter-rouge">T</code> which implements <code class="language-plaintext highlighter-rouge">IInterface</code>.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">void</span> <span class="n">Trick</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">T</span> <span class="n">instance</span><span class="p">)</span>
    <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="n">IInterface</span>
<span class="p">{</span>
    <span class="n">instance</span><span class="p">.</span><span class="nf">Method</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="benchmarks-2">Benchmarks</h4>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">MemoryDiagnoser</span><span class="p">]</span>
<span class="p">[</span><span class="n">RyuJitX64Job</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">ValueTypeInvokingInterfaceMethodSmart</span>
<span class="p">{</span>
    <span class="c1">// IInterface, ReferenceTypeImplementingInterface, ValueTypeImplementingInterface and fields are declared in previous benchmark</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">Baseline</span> <span class="p">=</span> <span class="k">true</span><span class="p">,</span> <span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ValueType</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ValueTypeSmart</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
        <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span> <span class="nf">AcceptingSomethingThatImplementsInterface</span><span class="p">(</span><span class="k">value</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">ReferenceType</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span>
        <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">reference</span><span class="p">);</span>
    <span class="p">}</span> 

    <span class="k">void</span> <span class="nf">AcceptingInterface</span><span class="p">(</span><span class="n">IInterface</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">.</span><span class="nf">DoNothing</span><span class="p">();</span>

    <span class="k">void</span> <span class="n">AcceptingSomethingThatImplementsInterface</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">T</span> <span class="n">instance</span><span class="p">)</span>
        <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="n">IInterface</span>
    <span class="p">{</span>
        <span class="n">instance</span><span class="p">.</span><span class="nf">DoNothing</span><span class="p">();</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Note:</strong> I have used <code class="language-plaintext highlighter-rouge">OperationsPerInvoke</code> feature of BenchmarkDotNet which is very usefull for nano-benchmarks.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]    : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64 : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Job=RyuJitX64  Jit=RyuJit  Platform=X64  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
        <th style="text-align: right">Scaled</th>
        <th style="text-align: right">Gen 0</th>
        <th style="text-align: right">Allocated</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>ValueType</td>
        <td style="text-align: right">5.572 ns</td>
        <td style="text-align: right">0.0322 ns</td>
        <td style="text-align: right">0.0252 ns</td>
        <td style="text-align: right">1.00</td>
        <td style="text-align: right">0.0076</td>
        <td style="text-align: right">24 B</td>
      </tr>
      <tr>
        <td>ValueTypeSmart</td>
        <td style="text-align: right"><strong>1.145 ns</strong></td>
        <td style="text-align: right">0.0101 ns</td>
        <td style="text-align: right">0.0094 ns</td>
        <td style="text-align: right"><strong>0.21</strong></td>
        <td style="text-align: right">-</td>
        <td style="text-align: right"><strong>0 B</strong></td>
      </tr>
      <tr>
        <td>ReferenceType</td>
        <td style="text-align: right"><strong>2.212 ns</strong></td>
        <td style="text-align: right">0.0096 ns</td>
        <td style="text-align: right">0.0081 ns</td>
        <td style="text-align: right"><strong>0.40</strong></td>
        <td style="text-align: right">-</td>
        <td style="text-align: right">0 B</td>
      </tr>
    </tbody>
  </table>

</div>

<p>By applying this simple trick we were able to not only avoid boxing but also outperform reference type interface method invocation! It was possible due to the optimization performed by JIT. I am going to call it method de-virtualization because I don’t have a better name for it. How does it work? Let’s consider following example:</p>

<p><strong>Note:</strong> Previous version of this blog post had a bug, which was spotted by Fons Sonnemans. There is no need for extra <code class="language-plaintext highlighter-rouge">struct</code> constraint to avoid boxing. Thank you Fons!</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">public</span> <span class="k">void</span> <span class="n">Method</span><span class="p">&lt;</span><span class="n">T</span><span class="p">&gt;(</span><span class="n">T</span> <span class="n">instance</span><span class="p">)</span>
        <span class="k">where</span> <span class="n">T</span> <span class="p">:</span> <span class="n">IDisposable</span>
<span class="p">{</span>
        <span class="n">instance</span><span class="p">.</span><span class="nf">Dispose</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>

<p>When the <code class="language-plaintext highlighter-rouge">T</code> is constrained with <code class="language-plaintext highlighter-rouge">where T : INameOfTheInterface</code>, the C# compiler emits additional <code class="language-plaintext highlighter-rouge">IL</code> instruction called <code class="language-plaintext highlighter-rouge">constrained</code> (<a href="https://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes.constrained.aspx">Docs</a>).</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="p">.</span><span class="n">method</span> <span class="k">public</span> <span class="n">hidebysig</span> 
    <span class="n">instance</span> <span class="k">void</span> <span class="n">Method</span><span class="p">&lt;([</span><span class="n">mscorlib</span><span class="p">]</span><span class="n">System</span><span class="p">.</span><span class="n">IDisposable</span><span class="p">)</span> <span class="n">T</span><span class="p">&gt;</span> <span class="p">(</span>
        <span class="p">!!</span><span class="n">T</span> <span class="err">'</span><span class="n">instance</span><span class="err">'</span>
    <span class="p">)</span> <span class="n">cil</span> <span class="n">managed</span> 
<span class="p">{</span>
    <span class="p">.</span><span class="n">maxstack</span> <span class="m">8</span>

    <span class="n">IL_0000</span><span class="p">:</span> <span class="n">ldarga</span><span class="p">.</span><span class="n">s</span> <span class="err">'</span><span class="n">instance</span><span class="err">'</span>
    <span class="n">IL_0002</span><span class="p">:</span> <span class="n">constrained</span><span class="p">.</span> <span class="p">!!</span><span class="n">T</span>
    <span class="n">IL_0008</span><span class="p">:</span> <span class="n">callvirt</span> <span class="n">instance</span> <span class="k">void</span> <span class="p">[</span><span class="n">mscorlib</span><span class="p">]</span><span class="n">System</span><span class="p">.</span><span class="n">IDisposable</span><span class="p">::</span><span class="nf">Dispose</span><span class="p">()</span>
    <span class="n">IL_000d</span><span class="p">:</span> <span class="n">ret</span>
<span class="p">}</span> <span class="c1">// end of method C::Method</span>
</code></pre></div></div>

<p>If the method is not generic, there is no constraint and the instance can be anything: value or reference type. In case it’s value type, the JIT performs boxing.
When the method is generic, JIT compiles a separate version of it per every value type. Which prevents boxing! How does it work?</p>

<p>JIT handles value types in a different way than reference types. Operations, like passing to a method or returning from it are the same for all reference types. We always deal with pointers, which have single, same size for all reference types. So JIT is reusing the compiled generic code for reference types because it can treat them in the same way. Imagine an array of <code class="language-plaintext highlighter-rouge">objects</code> or <code class="language-plaintext highlighter-rouge">strings</code>. From JITs perspective, it is just an array of pointers. So the array’s indexer implementation will be the same for all reference types.</p>

<p>Value Types are different. Each of them can have different size. For example passing <code class="language-plaintext highlighter-rouge">integer</code> and custom <code class="language-plaintext highlighter-rouge">struct</code> with two integer fields to a method has a different native implementation. In one case we push single int to the stack, in the other, we might need to move two fields to the registers, and then push them to the stack. So it’s different per every value type.</p>

<p>This is why JIT compiles ever generic method/type separately for generic value types arguments.</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Method</span><span class="p">&lt;</span><span class="kt">object</span><span class="p">&gt;();</span> <span class="c1">// JIT compiled code is common for all reference types</span>
<span class="n">Method</span><span class="p">&lt;</span><span class="kt">string</span><span class="p">&gt;();</span> <span class="c1">// JIT compiled code is common for all reference types</span>
<span class="n">Method</span><span class="p">&lt;</span><span class="kt">int</span><span class="p">&gt;();</span> <span class="c1">// dedicated version for int</span>
<span class="n">Method</span><span class="p">&lt;</span><span class="kt">long</span><span class="p">&gt;();</span> <span class="c1">// dedicated version for long</span>
<span class="n">Method</span><span class="p">&lt;</span><span class="n">DateTime</span><span class="p">&gt;();</span> <span class="c1">// dedicated version for DateTime</span>
</code></pre></div></div>

<p>It might lead to <a href="https://blogs.msdn.microsoft.com/carlos/2009/11/09/net-generics-and-code-bloat-or-its-lack-thereof/">generic Code Bloat</a>. But the great thing is that at this point in time, JIT can compile <strong>tailored</strong> code per type. And since the type is know, it can <strong>replace virtual call with direct call</strong>. As <a href="https://twitter.com/buybackoff">Victor Baybekov</a> mentioned in the comments, it can even remove the unnecessary null check for the call. It’s value type, so it can not be null. Inlining is also possible. 
For small methods, which are executed very often, like <code class="language-plaintext highlighter-rouge">.Equals()</code> in <a href="https://github.com/Spreads/Spreads.Unsafe#fastdictionary">custom Dictionary implementation</a> it can be very big performance gain.</p>

<p>We can see the effect of inlining if we run the same benchmarks for .NET 4.7, where RyuJit got improved and inlines all calls to <code class="language-plaintext highlighter-rouge">AcceptingSomethingThatImplementsInterface</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.9.313-nightly, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338348 Hz, Resolution=427.6523 ns, Timer=TSC
  [Host]       : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2106.0
  LegacyJitX64 : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit LegacyJIT/clrjit-v4.7.2106.0;compatjit-v4.7.2106.0
  LegacyJitX86 : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2106.0
  RyuJitX64    : .NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2106.0
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Job</th>
        <th>Jit</th>
        <th>Platform</th>
        <th style="text-align: right">Mean</th>
        <th style="text-align: right">Error</th>
        <th style="text-align: right">StdDev</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>ValueTypeSmart</td>
        <td>LegacyJitX64</td>
        <td>LegacyJit</td>
        <td>X64</td>
        <td style="text-align: right">1.2906 ns</td>
        <td style="text-align: right">0.0217 ns</td>
        <td style="text-align: right">0.0182 ns</td>
      </tr>
      <tr>
        <td>ValueTypeSmart</td>
        <td>LegacyJitX86</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">0.3367 ns</td>
        <td style="text-align: right">0.0064 ns</td>
        <td style="text-align: right">0.0060 ns</td>
      </tr>
      <tr>
        <td>ValueTypeSmart</td>
        <td>RyuJitX64</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">0.0004 ns</td>
        <td style="text-align: right">0.0006 ns</td>
        <td style="text-align: right">0.0005 ns</td>
      </tr>
    </tbody>
  </table>

</div>

<p><strong>Note:</strong> If you would like to play with generated IL code you can use the awesome <a href="https://sharplab.io/#v2:C4LglgNgNAJiDUAfAAgBgATIIwG4CwAUMgMyYBM6AwugN6HoOanIAs6AsgKbAAWA9jAA8AFQB8ACmHowAOwDOwAIYyAxpwCU9RtoDuPTgCdO6KSHQKDAVxXAo6AJIARMHIAOfOYoBGETloZ0BNrBjLIKymoAdM5uHpzi6vhBjAC+hClAA===">SharpLab</a>.</p>

<h2 id="copying">Copying</h2>

<p>In C# by default Value Types are passed to methods by value. It means that the Value Type instance is copied every time we pass it to a method. Or when we return it from a method. The bigger the Value Type is, the more expensive it is to copy it. For small value types, the JIT compiler might optimize the copying (inline the method, use registers for copying &amp; more).</p>

<div class="language-cs highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">RyuJitX64Job</span><span class="p">,</span> <span class="n">LegacyJitX86Job</span><span class="p">]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">CopyingValueTypes</span>
<span class="p">{</span>
    <span class="k">class</span> <span class="nc">ReferenceType1Field</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">class</span> <span class="nc">ReferenceType2Fields</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">class</span> <span class="nc">ReferenceType3Fields</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">Z</span><span class="p">;</span> <span class="p">}</span>

    <span class="k">struct</span> <span class="nc">ValueType1Field</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">struct</span> <span class="nc">ValueType2Fields</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">struct</span> <span class="nc">ValueType3Fields</span> <span class="p">{</span> <span class="kt">int</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="n">Z</span><span class="p">;</span> <span class="p">}</span>

    <span class="n">ReferenceType1Field</span> <span class="n">fieldReferenceType1Field</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ReferenceType1Field</span><span class="p">();</span>
    <span class="n">ReferenceType2Fields</span> <span class="n">fieldReferenceType2Fields</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ReferenceType2Fields</span><span class="p">();</span>
    <span class="n">ReferenceType3Fields</span> <span class="n">fieldReferenceType3Fields</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ReferenceType3Fields</span><span class="p">();</span>

    <span class="n">ValueType1Field</span> <span class="n">fieldValueType1Field</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ValueType1Field</span><span class="p">();</span>
    <span class="n">ValueType2Fields</span> <span class="n">fieldValueType2Fields</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ValueType2Fields</span><span class="p">();</span>
    <span class="n">ValueType3Fields</span> <span class="n">fieldValueType3Fields</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">ValueType3Fields</span><span class="p">();</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ReferenceType1Field</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ReferenceType1Field</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>
    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ReferenceType2Fields</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ReferenceType2Fields</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>
    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ReferenceType3Fields</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ReferenceType3Fields</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>

    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ValueType1Field</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ValueType1Field</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>
    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ValueType2Fields</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ValueType2Fields</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>
    <span class="p">[</span><span class="nf">MethodImpl</span><span class="p">(</span><span class="n">MethodImplOptions</span><span class="p">.</span><span class="n">NoInlining</span><span class="p">)]</span> <span class="n">ValueType3Fields</span> <span class="nf">Return</span><span class="p">(</span><span class="n">ValueType3Fields</span> <span class="n">instance</span><span class="p">)</span> <span class="p">=&gt;</span> <span class="n">instance</span><span class="p">;</span>

    <span class="p">[</span><span class="nf">Benchmark</span><span class="p">(</span><span class="n">OperationsPerInvoke</span> <span class="p">=</span> <span class="m">16</span><span class="p">)]</span>
    <span class="k">public</span> <span class="k">void</span> <span class="nf">TestReferenceType1Field</span><span class="p">()</span>
    <span class="p">{</span>
        <span class="kt">var</span> <span class="n">instance</span> <span class="p">=</span> <span class="n">fieldReferenceType1Field</span><span class="p">;</span>
        <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span>
        <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span>
        <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span>
        <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span> <span class="n">instance</span> <span class="p">=</span> <span class="nf">Return</span><span class="p">(</span><span class="n">instance</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="c1">// removed</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The rest of the code was removed for brevity. You can find full code <a href="https://gist.github.com/adamsitnik/5c1b36c75c94c3ab819de47b5addf3bc">here</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BenchmarkDotNet=v0.10.8, OS=Windows 8.1 (6.3.9600)
Processor=Intel Core i7-4700MQ CPU 2.40GHz (Haswell), ProcessorCount=8
Frequency=2338337 Hz, Resolution=427.6544 ns, Timer=TSC
  [Host]       : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  LegacyJitX86 : Clr 4.0.30319.42000, 32bit LegacyJIT-v4.6.1649.1
  RyuJitX64    : Clr 4.0.30319.42000, 64bit RyuJIT-v4.6.1649.1

Runtime=Clr  
</code></pre></div></div>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>TestReferenceType1Field</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">1.399 ns</td>
      </tr>
      <tr>
        <td>TestReferenceType2Fields</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">1.392 ns</td>
      </tr>
      <tr>
        <td>TestReferenceType3Fields</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">1.388 ns</td>
      </tr>
      <tr>
        <td>TestReferenceType1Field</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">1.737 ns</td>
      </tr>
      <tr>
        <td>TestReferenceType2Fields</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">1.770 ns</td>
      </tr>
      <tr>
        <td>TestReferenceType3Fields</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">1.711 ns</td>
      </tr>
    </tbody>
  </table>

</div>
<p>Passing and returning Reference Types is size-independent. Only a copy of the pointer is passed. And pointer can always fit into CPU register.
```</p>
<div class="scrollable-table-wrapper">

  <table>
    <thead>
      <tr>
        <th>Method</th>
        <th>Jit</th>
        <th>Platform</th>
        <th style="text-align: right">Mean</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>TestValueType1Field</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">1.410 ns</td>
      </tr>
      <tr>
        <td>TestValueType2Fields</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">6.859 ns</td>
      </tr>
      <tr>
        <td>TestValueType3Fields</td>
        <td>LegacyJit</td>
        <td>X86</td>
        <td style="text-align: right">6.837 ns</td>
      </tr>
      <tr>
        <td>TestValueType1Field</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">1.465 ns</td>
      </tr>
      <tr>
        <td>TestValueType2Fields</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">8.403 ns</td>
      </tr>
      <tr>
        <td>TestValueType3Fields</td>
        <td>RyuJit</td>
        <td>X64</td>
        <td style="text-align: right">2.627 ns</td>
      </tr>
    </tbody>
  </table>

</div>

<p>The bigger the Value Type is, the more expensive copying is. 
 Have you noticed that <code class="language-plaintext highlighter-rouge">TestValueType3Fields</code> was faster than <code class="language-plaintext highlighter-rouge">TestValueType2Fields</code> for <code class="language-plaintext highlighter-rouge">RyuJit</code>? To answer the question why we would need to analyse the generated native assembly code.</p>

<p><strong>How can we avoid copying big Value Types? We should pass and return them by Reference!</strong>
 I am going to leave it here, and continue with my <a href="https://adamsitnik.com/ref-returns-and-ref-locals/">ref returns and locals</a> blog post next week.</p>

<h2 id="summary">Summary</h2>

<ul>
  <li>Every instance of a reference type has two extra fields used internally by CLR.</li>
  <li>Value Types have no hidden overhead, so they have better data locality.</li>
  <li>Reference Types are managed by GC. It tracks the references, offers fast allocation and expensive, non-deterministic deallocation.</li>
  <li>Value Types are not managed by the GC. Value Types = No GC. And No GC is better than any GC!</li>
  <li>Whenever a reference is required value types are being boxed. Boxing is expensive, adds an extra pressure for the GC. You should avoid boxing if you can.</li>
  <li>By using generic constraints we can avoid boxing and even de-virtualize interface method calls for Value Types!</li>
  <li>Value Types are passed to and returned from methods by Value. So by default, they are copied all the time.</li>
</ul>

<p><strong>VERY Important!!</strong> <a href="https://www.amazon.com/dp/1430244585">Pro .NET Performance</a> book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow has a whole chapter dedicated to Type Internals. If you want to learn more about it, you should definitely read it. It’s the best source available, my blog post is just an overview!</p>

<h2 id="sources">Sources</h2>

<ul>
  <li><a href="https://www.amazon.com/dp/1430244585">Pro .NET Performance</a> book by Sasha Goldshtein, Dima Zurbalev, Ido Flatow</li>
  <li><a href="https://tooslowexception.com/how-does-gettype-work/">How does Object.GetType() really work?</a> blog post by Konrad Kokosa</li>
  <li><a href="https://www.infoq.com/presentations/csharp-systems-programming">Safe Systems Programming in C# and .NET</a> video by Joe Duffy</li>
  <li><a href="https://cs.umw.edu/~finlayson/class/spring16/cpsc305/notes/14-memory.html">Memory Systems</a> article by University Of Mary Washington</li>
  <li><a href="https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html">Latency Numbers Every Programmer Should Know</a> article by Berkeley University</li>
  <li><a href="https://en.wikipedia.org/wiki/Locality_of_reference#Types_of_locality">Types of locality</a> definition by Wikipedia</li>
  <li><a href="https://software.intel.com/en-us/articles/understanding-how-general-exploration-works-in-intel-vtune-amplifier-xe">Understanding How General Exploration Works in Intel® VTune™ Amplifier XE</a> by Jackson Marusarz (Intel)</li>
  <li><a href="https://xoofx.com/blog/2015/10/08/stackalloc-for-class-with-roslyn-and-coreclr/">A new stackalloc operator for reference types with CoreCLR and Roslyn</a> blog post by Alexandre Mutel</li>
  <li><a href="https://msdn.microsoft.com/pl-pl/library/yz2be5wk(v=vs.90).aspx">Boxing and Unboxing</a> article by MSDN</li>
  <li><a href="https://blog.jetbrains.com/dotnet/2014/06/06/heap-allocations-viewer-plugin/">Heap Allocations Viewer plugin</a> blog post by Matt Ellis (JetBrains)</li>
  <li><a href="https://sharplab.io/">SharpLab.io</a></li>
  <li><a href="https://msdn.microsoft.com/en-us/library/system.reflection.emit.opcodes.constrained.aspx">OpCodes.Constrained Field</a> article by MSDN</li>
  <li><a href="https://blogs.msdn.microsoft.com/carlos/2009/11/09/net-generics-and-code-bloat-or-its-lack-thereof/">.NET Generics and Code Bloat</a> article by MSDN</li>
  <li><a href="https://stackoverflow.com/a/5532061">What happens with a generic constraint that removes this requirement?</a> Stack Overflow answer by Eric Lippert</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[tl;dr structs have better data locality. Value types add much less pressure for the GC than reference types. But big value types are expensive to copy and you can accidentally box them which is bad. Introduction The .NET framework implements Reference Types and Value Types. C# allows us to define custom value types by using struct and enum keywords. class, delegate and interface are for reference types. Primitive types, like byte, char, short, int and long are value types, but developers can’t define custom primitive types. In Java primitive types are also value types, but Java does not expose a possibility to define custom value types for developers ;) Value Types and Reference Types are very different in terms of performance characteristics. In my next blog posts, I am going to describe ref returns and locals, ValueTask&lt;T&gt; and Span&lt;T&gt;. But I need to clarify this matter first, so the readers can understand the benefits. Note: To keep my comparison simple I am going to use ValueTuple&lt;int, int&gt; and Tuple&lt;int, int&gt; as the examples. Memory Layout Every instance of a reference type has extra two fields that are used internally by CLR. ObjectHeader is a bitmask, which is used by CLR to store some additional information. For example: if you take a lock on a given object instance, this information is stored in ObjectHeader. MethodTable is a pointer to the Method Table, which is a set of metadata about given type. If you call a virtual method, then CLR jumps to the Method Table and obtains the address of the actual implementation and performs the actual call. Both hidden fields size is equal to the size of a pointer. So for 32 bit architecture, we have 8 bytes overhead and for 64 bit 16 bytes. Value Types don’t have any additional overhead members. What you see is what you get. This is why they are more limited in terms of features. You cannot derive from struct, lock it or write finalizer for it. RAM is very cheap. So, what’s all the fuss about?]]></summary></entry></feed>