Pooling large arrays with ArrayPool

tl;dr Use ArrayPool<T> for large arrays to avoid Full GC.

Introduction

.NET’s Garbage Collector (GC) implements many performance optimizations. One of them, the generational model assumes that young objects die quickly, whereas old live longer. This is why managed heap is divided into three Generations. We call them Gen 0 (youngest), Gen 1 (short living) and Gen 2 (oldest). New objects are allocated in Gen 0. When GC tries to allocate a new object and Gen 0 is full, it performs the Gen 0 cleanup. So it performs a partial cleanup (Gen 0 only)! It is traversing the object’s graph, starting from the roots (local variables, static fields & more) and marks all of the referenced objects as living objects.

This is the first phase, called “mark”. This phase can be nonblocking, everything else that GC does is fully blocking. GC suspends all of the application threads to perform next steps!

Living objects are being promoted (most of the time moved == copied!) to Gen 1, and the memory of Gen 0 is being cleaned up. Gen 0 is usually very small, so this is very fast. In a perfect scenario, which could be a web request, none of the objects survive. All allocated objects should die when the request ends. So GC just sets the next object pointer to the beginning of Gen 0. After some Gen 0 collections, we get to the situation, when Gen 1 is also full, so GC can’t just promote more objects to it. Then it simply collects Gen 1 memory. Gen 1 is also small, so it’s fast. Anyway, the Gen 1 survivors are being promoted to Gen 2. Gen 2 objects are supposed to be long living objects. Gen 2 is very big and it’s very time-consuming to collect its memory. So garbage collection of Gen 2 is something that we want to avoid. Why? let’s take a look at the following video to find out how the Gen 2 collection can affect user experience:

Large Object Heap (LOH)

When GC is promoting objects to next generation it’s copying the memory. As you can imagine, it would be very time-consuming for large objects like big arrays or strings. This is why GC has another optimization. Any object that is bigger than 85 000 bytes is considered to be large. Large objects are stored in a separate part of the managed heap, called Large Object Heap (LOH). This part is managed with free list algorithm. It means that GC has a list of free segments of memory, and when we want to allocate something big, it’s searching through the list to find a feasible segment of memory for it. So large objects are by default never moved in memory. However, if you run into LOH fragmentation issues you need to compact LOH. Since .NET 4.5.1 you can do this on demand.

The Problem

When a large object is allocated, it’s marked as Gen 2 object. Not Gen 0 as for small objects. The consequences are that if you run out of memory in LOH, GC cleans up whole managed heap, not only LOH. So it cleans up Gen 0, Gen 1 and Gen 2 including LOH. This is called full garbage collection and is the most time-consuming garbage collection. For many applications, it can be acceptable. But definitely not for high-performance web servers, where few big memory buffers are needed to handle an average web request (read from a socket, decompress, decode JSON & more).

.NET Managed Heap

Is it a problem for your application? You can use the built-in perfmon.exe to get an initial overview.

Perfmon

Perfmon results

As you can see it’s not a problem for my Visual Studio. It was running for few hours and #Gen 2 Collections is very low compared to Gen 0/1.

The Solution

The solution is very simple: buffer pooling. Pool is a set of initialized objects that are ready to use. Instead of allocating a new object, we rent it from the pool. Once we are done using it, we return it to the pool. Every large managed object is an array or an array wrapper (string contains a length field and an array of chars). So we need to pool arrays to avoid this problem.

ArrayPool<T> is a high performance pool of managed arrays. You can find it in System.Buffers package and it’s source code is available on GitHub. It’s mature and ready to use in the production. It targets .NET Stadard 1.1 which means that you can use it not only in your new and fancy .NET Core apps, but also in the existing .NET 4.5.1 apps as well!

Sample

var samePool = ArrayPool<byte>.Shared;
byte[] buffer = samePool.Rent(minLength);
try
{
    Use(buffer);
}
finally
{
    samePool.Return(buffer);
    // don't use the reference to the buffer after returning it!
}

void Use(byte[] buffer) // it's an array

How to use it?

First of all you need to obtain an instance of the pool. You can do in at least three ways:

  • Recommended: use the ArrayPool<T>.Shared property, which returns a shared pool instance. It’s thread safe and all you need to remember is that it has a default max array length, equal to 2^20 (1024*1024 = 1 048 576).
  • Call the static ArrayPool<T>.Create method, which creates a thread safe pool with custom maxArrayLength and maxArraysPerBucket. You might need it if the default max array length is not enough for you. Please be warned, that once you create it, you are responsible for keeping it alive.
  • Derive a custom class from abstract ArrayPool<T> and handle everything on your own.

Next thing is to call the Rent method which requires you to specify minimum length of the buffer. Keep in mind, that what Rent returns might be bigger than what you have asked for.

byte[] webRequest = request.Bytes;
byte[] buffer = ArrayPool<byte>.Shared.Rent(webRequest.Length);

Array.Copy(
    sourceArray: webRequest, 
    destinationArray: buffer, 
    length: webRequest.Length); // webRequest.Length != buffer.Length!!

Once you are done using it, you just Return it to the SAME pool. Return method has an overload, which allows you to cleanup the buffer so subsequent consumer via Rent will not see the previous consumer’s content. By default the contents are left unchanged.

Very imporant note from ArrayPool code:

Once a buffer has been returned to the pool, the caller gives up all ownership of the buffer. The reference returned from a given call to Rent must be returned via Return only once.

It means, that the developer is responsible for doing things right. If you keep using the reference to the buffer after returning it to the pool, you are risking unexpected behavior. As far as I know, there is no static code analysis tool that can verify the correct usage (as of today). ArrayPool is part of the corefx library, it’s not a part of the C# language.

The Benchmarks!!!

Let’s use BenchmarkDotNet and compare the cost of allocating arrays with the new operator vs pooling them with ArrayPool<T>. To make sure that allocation benchmark is including time spent in GC, I am configuring BenchmarkDotNet explicity to not force GC collections. Pooling is combined cost of Rent and Return. I am running the benchmarks for .NET Core 2.0, which is important because it has faster version of ArrayPool<T>. For .NET Core 2.0 ArrayPool<T> is part of the clr, whereas older frameworks use corefx version. Both versions are really fast, comparison of them and analysis of their design could be a separate blog post.

class Program
{
    static void Main(string[] args) => BenchmarkRunner.Run<Pooling>();
}

[MemoryDiagnoser]
[Config(typeof(DontForceGcCollectionsConfig))] // we don't want to interfere with GC, we want to include it's impact
public class Pooling
{
    [Params((int)1E+2, // 100 bytes
        (int)1E+3, // 1 000 bytes = 1 KB
        (int)1E+4, // 10 000 bytes = 10 KB
        (int)1E+5, // 100 000 bytes = 100 KB
        (int)1E+6, // 1 000 000 bytes = 1 MB
        (int)1E+7)] // 10 000 000 bytes = 10 MB
    public int SizeInBytes { get; set; }

    private ArrayPool<byte> sizeAwarePool;

    [GlobalSetup]
    public void GlobalSetup() 
        => sizeAwarePool = ArrayPool<byte>.Create(SizeInBytes + 1, 10); // let's create the pool that knows the real max size

    [Benchmark]
    public void Allocate() 
        => DeadCodeEliminationHelper.KeepAliveWithoutBoxing(new byte[SizeInBytes]);

    [Benchmark]
    public void RentAndReturn_Shared()
    {
        var pool = ArrayPool<byte>.Shared;
        byte[] array = pool.Rent(SizeInBytes);
        pool.Return(array);
    }

    [Benchmark]
    public void RentAndReturn_Aware()
    {
        var pool = sizeAwarePool;
        byte[] array = pool.Rent(SizeInBytes);
        pool.Return(array);
    }
}

public class DontForceGcCollectionsConfig : ManualConfig
{
    public DontForceGcCollectionsConfig()
    {
        Add(Job.Default
            .With(new GcMode()
            {
                Force = false // tell BenchmarkDotNet not to force GC collections after every iteration
            }));
    }
}

The Results

If you are not familiar with the output produced by BenchmarkDotNet with Memory Diagnoser enabled, you can read my dedicated blog post to find out how to read these results.

BenchmarkDotNet=v0.10.7, OS=Windows 10 Redstone 1 (10.0.14393)
Processor=Intel Core i7-6600U CPU 2.60GHz (Skylake), ProcessorCount=4
Frequency=2742189 Hz, Resolution=364.6722 ns, Timer=TSC
dotnet cli version=2.0.0-preview1-005977
  [Host]     : .NET Core 4.6.25302.01, 64bit RyuJIT
  Job-EBWZVT : .NET Core 4.6.25302.01, 64bit RyuJIT
Method SizeInBytes Mean Gen 0 Gen 1 Gen 2 Allocated
Allocate 100 8.078 ns 0.0610 - - 128 B
RentAndReturn_Shared 100 44.219 ns - - - 0 B

For very small chunks of memory the default allocator can be faster.

Method SizeInBytes Mean Gen 0 Gen 1 Gen 2 Allocated
Allocate 1 000 41.330 ns 0.4880 0.0000 - 1024 B
RentAndReturn_Shared 1 000 43.739 ns - - - 0 B

For 1 000 bytes they are almost on par.

Method SizeInBytes Mean Gen 0 Gen 1 Gen 2 Allocated
Allocate 10 000 374.564 ns 4.7847 0.0000 - 10024 B
RentAndReturn_Shared 10 000 44.223 ns - - - 0 B

The bigger it gets, the slower it takes to allocate the memory.

Method SizeInBytes Mean Gen 0 Gen 1 Gen 2 Allocated
Allocate 100 000 3,637.110 ns 31.2497 31.2497 31.2497 100024 B
RentAndReturn_Shared 100 000 46.649 ns - - - 0 B

Gen 2 collections! 100 000 > 85 000, so we get our first Full Garbage Collections!

Method SizeInBytes Mean StdDev Gen 0/1/2 Allocated
RentAndReturn_Shared 100 44.219 ns 0.0314 ns - 0 B
RentAndReturn_Shared 1 000 43.739 ns 0.0337 ns - 0 B
RentAndReturn_Shared 10 000 44.223 ns 0.0333 ns - 0 B
RentAndReturn_Shared 100 000 46.649 ns 0.0346 ns - 0 B
RentAndReturn_Shared 1 000 000 42.423 ns 0.0623 ns - 0 B

At this point of time, you should have noticed, that the cost of pooling with ArrayPool<T> is constant and size-independent! It’s great, because you can predict the behaviour of your code.

BufferAllocated

But what happens if you try to rent a buffer, that exceeds the max array length of given pool (2^20 for ArrayPool.Shared)?

Method SizeInBytes Mean Gen 0 Gen 1 Gen 2 Allocated
Allocate 10 000 000 557,963.968 ns 211.5625 211.5625 211.5625 10000024 B
RentAndReturn_Shared 10 000 000 651,147.998 ns 207.1484 207.1484 207.1484 10000024 B
RentAndReturn_Aware 10 000 000 47.033 ns - - - 0 B

A new buffer is allocated every time you ask for more than the max array length! And when you return it to the pool, it’s just being ignored. Not somehow added to the pool.

Don’t worry! ArrayPool<T> has it’s own ETW Event Provider, so you can use PerfView or any other tool to profile or monitor your application and watch for the BufferAllocated event.

System.Buffers.ArrayPoolEventSource

BufferAllocated event

To avoid this problem you can use ArrayPool<T>.Create method, which creates a pool with custom maxArrayLength. But don’t create too many custom pools!! The goal of pooling is to keep LOH small. If you create too many pools, you will end up with large LOH, full of big arrays that can not be reclaimed by GC (because they are going to be rooted by your custom pools). This is why all popular libraries like ASP.NET Core or ImageSharp use ArrayPool<T>.Shared only. If you start using ArrayPool<T>.Shared instead of allocating with new operator, then in the pessimistic scenario (asking it for array > default max size) you will be slightly slower than before (you will do an extra check and then allocate). But in the optimistic scenario, you will be much faster, because you will just rent it from the pool. So this is why I believe that you can use ArrayPool<T>.Shared by default. ArrayPool<T>.Create should be used if BufferAllocated events are frequent.

Pooling MemoryStream(s)

Sometimes an array might be not enough to avoid LOH allocations. An example can be 3rd party api that accepts Stream instance. Thanks to Victor Baybekov I have discovered Microsoft.IO.RecyclableMemoryStream. This library provides pooling for MemoryStream objects. It was designed by Bing engineers to help with LOH issues. For more details you can go this blog post by Ben Watson.

Summary

  • LOH = Gen 2 = Full GC = bad performance
  • ArrayPool was designed for best possible performance
  • Pool the memory if you can control the lifetime
  • Use ArrayPool<T>.Shared by default
  • Pool allocates the memory for buffers > maxArrayLength
  • The fewer pools, the smaller LOH, the better!

Sources

Written on June 20, 2017