Grass Fin Fill Rate

The previous post illustrated a method of generating grass on the GPU using Compute Shaders. One thing that this doesn’t improve upon is the problem of Fill Rate.

The Fill Rate is the amount of times a specific pixel on screen is written to by the GPU. Its really easy to see if your application is Fill Rate restricted; just resize the viewport. If it gets more FPS then you are Fill Rate throttled.

Where the previous post talked about the use of billboards/imposters/fins to generate complex grass, it ignored the problem of the resulting fill rate impact. All those GPU generated quads had a pixel hit rate of about 10% (i.e. 90% of the texture space was alpha 0 and culled). Because clip() was being used rather than alpha blending it wasn’t as bad as it could have been, but the use of Clip() prevents early triangle rejection by the shader. Since it cannot assume, despite being in Opaque alpha mode, that any triangle is actually going to cover any other triangle it must render them all.

There isn’t really any way around this if you are using imposters; so I created a grass model patch of opaque blades and rendered that. I looked for a good low poly model but they were all generated for use in static scenes and grass is a pretty easy model to dynamically generate;

1) Generate a 2D grid of seed points. The model size should be the same as the voxel size of the orthographic planting texture discussed in the last post, such that each pixel of that (and consequent AppendBuffer point) represents a single instance of the grass patch.
2) For each seed point, generate a 0 origin blade of 7 points – two quads with a triangle on top.
3) Pick a ‘direction’ normal based a random value X,Z and translate the resulting blade points based on the square of the Y value; this means the grass bends towards the intended direction more acutely the taller it is, with the base quad being only slightly tilted toward the ‘direction’.
4) Rotate the blade points around the Y axis in a random rotation.
5) Calculate the normals of each triangle.
6) Translate to the model seed point.

Once that is created it can be set as a shader resource;

// Buffer that will be filled with the model triangle list.
struct ModelVertex
{
	float3 Position;
	float3 Normal;
	float2 TexCoord;
};
StructuredBuffer  Param_ModelBuffer : register(t15);

I create three of these models be eliminating every other seed point, to generate a less and less dense model. While doing that I double the width of the remaining grass blades.

To finish off the effect I blend the grass blade color into the ground color by lerp’ing the distance from the viewer; I do the same trick with the grass blade normal as well. This helps it blend into the flat ground texture.

You can see the effect in the YouTube video here

Advertisements

Unblocking the Bottleneck – Better Grass

So having tried Geometry Shader generation and simple “Big Vertex Buffer” grass generation I stumbled on the post http://blog.codemasters.com/grid/10/rendering-fields-of-grass-in-grid-autosport which illustrated a pretty common (but unknown to me) method of vertex generation on the GPU that looks like it could be the solution I was after.

I searched for some time for a really clear guide to what was going on, and although the codemasters blog hinted and the basics, what I needed was to follow a step-by-step to see how it worked.

I couldn’t find one, so I thought I’d write one.

Preparation

The method assumes you are generating an orthogonal (look down) image of the immediate surroundings of the eye camera in world space. I typically create two orthogonal textures per frame; one is a simple rendering of the basic landscape (without trees or other infill) and the second is a heightmap. This guide doesn’t describe creating those – it assumes you have them.

The Method

Code later; first a description of how you go about doing the work.

  1. Create a Generation Compute Shader and an Append Buffer. This will generate vertexes by examining the two orthogonal textures generated above. As it samples the two textures it will emit (append) new Vertexes for each world location where a grass blade should be, and sample the height map to work out how high it is.
  2. Create a Vertex Count Compute Shader and a Unordered Access Buffer to store the count of the vertexes generated in stage 1. This is fiddly to understand but crucial for performance
  3. Create a Vertex and Pixel Shader pair for rendering the vertexes generated in stage 1, along with a Structured Buffer to transfer the data from the Append Buffer so it can be used.

Thats it. You will of course need some grass textures (imposters/fins) to actually render, but this example will be completed when we are rendering a million quads.

The ‘trick’ to good performance with this method is that nothing must be allowed to move from the CPU to the GPU in the entire sequence of rendering. This means all of this is done on the GPU and nothing transits the memory pipeline.

Definitions

You may not have come across Computer Shaders and Append Buffers before; so what are they ?

A Compute Shader is a arbitrary piece of code executed on the GPU with massive parallelization. Its important to note that the order in which your code is executed is not defined so the outputs from the Compute Shader can come out in any order, and this is important if you are using the Compute Shader to generate mesh vertexes. Normally you would output a mesh in some kind of winding order but if you are looking to generate a mesh from a Compute Shader you have to lose control over the winding order and on that basis; its not a good idea to try to generate meshes from them, so we output Particles which we later expand into meshes. For this reason sometimes the method shown here is termed Particle Shading.

We’re used to writing Vertex Shader code and Pixel Shader code both of which have pretty well-defined inputs and outputs; a VS takes a Vertex from a Vertex Buffer and outputs a struct per-vertex. The values of the struct are interpolated (using a hidden Compute Shader step) to provide a per-pixel set of values for the same struct to pass into the Pixel Shader. The PS outputs a set of scalar values, typically just a float4 set of colours (although sometimes more when performing multiple render-target operations).

The key to understanding Compute Shaders is to realise the VS and PS pair are just specialised forms of Compute Shader – they have very distinct input and output expectations which the inherit from their fixed function pipeline past; but ultimately they are just Compute Shaders with added restrictions.

A Compute Shader can take its input from any Buffer passed in as a resource, and write its output to any Buffer passed in as a resource. The VS can only read from the buffer passed in as a “VertexBuffer” and the PS can only write to the resource passed in as a “RenderTarget”; they are just very restricted forms of Compute Shader.

In this post I will be using my Compute Shader to write to an AppendBuffer and read from a Texture. An AppendBuffer is an optimised memory structure which can only be written to, so is perfect for creating new geometry.

Using Compute Shaders

Thinking back to a VS/PS pair, we can easily visualise how many times the VS is going to be called, becuase its simply the number of Indexes specified in the DrawIndexed call. The Struct passed into the VS is the n’th item in the input buffer of Structs which describe the geometry.

When we call a Compute Shader its not operating on any specific Buffer resource thats been bound to the shader step – it could operate one, many or none of the Buffers that it can access. The Compute Shader is simply told to execute ‘n’ times by the programmer using two controlling parameters.

In HLSL the Compute Shader must have the following prefix;

numthreads(32,32,1)]
void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
{

}

The actual numbers here get presented to the Compute Shader code via its input struct threadID. This is a 3 dimensional uint which tells the programmer which thread the Compute Shader is currently running. In the above example the value of threadID will be threadId.x 1->32, threadId.y 1->32, threadId.z = 1. This means the Compute Shader will be executed 32x32x1 times.

If we are intending to sample a Texture2D resouce passed into the shader, we can use the threadId as a parameter to the Load function to load and examine a specific pixel.

Texture2D inputTexture;
uint2 textureSize;

numthreads(32,32,1)
void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
{
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
  {
     float4 pixelValue = inputTexture.Load(threadID.x,threadID.y);
  }
}

Note the need for the guard code; having told the shader to execute 32×32 times we need to make sure we only sample the texture within the textures boundaries (which we explicitly pass in via the textureSize parameter).

What should we do with the sampled pixel ?


Texture2D inputTexture;
uint2 textureSize;
AppendBuffer foliageLocation;
float voxelSize;
float2 worldOffset;

numthreads(32,32,1)
void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
{
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
    {
       foliageLocation.Append(float3(
         worldOffset.x + (threadID.x * voxelSize),
         0,
         worldOffset.y + (threadID.y * voxelSize)
         );
    }
  }
}

In the example of a Compute Shader sampling a texture it makes sense to define a sensible set of numthreads() in the shader such as 32,32,1 and then calculate the best number to use within the program – its there where we can tune the number of threads to the actual size of the texture being passed in;

  device.ImmediateContext.Dispatch(textureSize.x / 32,textureSize.y / 32,1);

In the above case my shader, for a texture size of 100×100 pixels, will be called 16 times (4×4) and the ThreadID.x and ThreadID.y values will go from 0 to 128. We must always bear in mind that shader may be called with parameters outside the bounds of the Texture we are passing in and this is even more important if the shader is accessing a data structure that is not boundary-checked; we could easily get errant results by sampling beyond an array boundary.

The Code

The Generation Step

There will be lot of code examples, and they use my own SharpDX library to prevent code bloat. The referenced FoliageEffect class is simply a collection of the resources needed to complete the render, and it just exposes out SharpDX resources to the render loop. Hopefully its clear as to how it works – I’m happy to add more code examples if anything is not clear.

Create a constant buffer (DrapeConstants) which describes the orthogonal “drape” texture that we will use to drive the generation;

// Cull the elements generated to those we can actually see
BoundingBox bb = BoundingBox.FromPoints(eyeCamera.ViewFrustum.GetCorners());
// Create a struct to pass to the shader 
RTC5.Runtime.Landscape.RenderedComponents.FoliageEffect.DrapeConstants
  drapeConstants = new RenderedComponents.FoliageEffect.DrapeConstants();

// give a world bottom left offset
drapeConstants.DrapeSmallestXZCorner = new Vector2(
  drapeTextureWorldRectangle.Left, 
  drapeTextureWorldRectangle.LowestZValue);

// tell the shader the texture pixel size and world size
drapeConstants.DrapeTextureSize = drapeTexture.Texture.Description.Width;
drapeConstants.DrapeWorldSize = drapeTextureWorldRectangle.Width;

// tell the shader how tall we will want the generated grass
drapeConstants.BladeMaxHeight = 0.35f;

// tell the shader how far from the camera position that teh grass will 
// be visible from
drapeConstants.FoliageRadiusFromCentre = 20.0f;

// tell the shader how much of the world is currently visible. We
// will use this to restrict the zone of the texture that we need to sample
drapeConstants.VisibleBox = new Vector4(
  bb.Minimum.X, bb.Minimum.Z, bb.Maximum.X, bb.Maximum.Z);

// We will later pass in a texture array of foliage vertexes; we pass
// in here how big that array will be.
drapeConstants.TextureAtlasTileCount = 7;

// Updae the buffer and pass it into the Compute Shader stage.
foliageEffect.DrapeConstantBuffer.UpdateValueAndBindToShader(
  ref drapeConstants,
  this.MonitoredGraphicsDevice,
  MonitoredGraphicsDevice.enumShaderStage.Compute);

Next bind the ‘drape’ texture and heightmap to the shader ready for sampling

// Bind a texture buffer containing the orthogonal 'drape' picture of our 
// immediate surroundings, and resource view to the compute shader stage
foliageEffect.DrapeTexture.Texture = drapeTexture.Texture;
foliageEffect.DrapeTexture.ShaderResourceView =
  drapeTexture.ShaderResourceView;
foliageEffect.DrapeTexture.BindToShader(
  this.MonitoredGraphicsDevice,
  MonitoredGraphicsDevice.enumShaderStage.Compute);

// Bind the height map and its SRV to the compute shader stage
foliageEffect.HeightMap.Texture = heightTexture.Texture;
foliageEffect.HeightMap.ShaderResourceView = heightTexture.ShaderResourceView;
foliageEffect.HeightMap.BindToShader(
   this.MonitoredGraphicsDevice,
   MonitoredGraphicsDevice.enumShaderStage.Compute);

Create a Buffer which will be used for the generated vertexes to be appended to. In DirectX 11 it has the concept of an append-only buffer that a compute shader can write to. Declare it like this (the value InitialData is a single dimension array with enough space to fit the maximum number of grass vertexes in, in theory one element per pixel – its only used in initialization and can be discarded afterwards).


// The Vertex append buffer is a Unordered Access Buffer
SharpDX.Direct3D11.Buffer vertexAppendBuffer =
  SharpDX.Direct3D11.Buffer.Create(
    device,
    SharpDX.Direct3D11.BindFlags.UnorderedAccess |
    SharpDX.Direct3D11.BindFlags.ShaderResource,
    initialData,
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    // Will not be read back to the CPU
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    optionFlags: SharpDX.Direct3D11.ResourceOptionFlags.BufferStructured,
    structureByteStride: FoliageFinVertex.GetSize());

// Create a view on this texture
SharpDX.Direct3D11.UnorderedAccessView vertexAppendBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    device,
    vertexAppendBuffer,
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
    {
      Format = SharpDX.DXGI.Format.Unknown,
      Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
      Buffer = new
      SharpDX.Direct3D11.UnorderedAccessViewDescription.BufferResource()
      {
        FirstElement = 0,
        // This represents the maximum number of elements that can be in
        // the buffer, in theory one value per pixel of the drape texture
        ElementCount = initialData.Length,
        Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.Append
      }
  });

The struct FoliageFinVertex is the data that will be generated per blade of grass by the Geometry Shader. My simple set of data looks like this;

Create a Buffer which will store the number of vertexes generated.


[StructLayout(LayoutKind.Explicit, Size = 16)
public struct VertexCountConstants
{
   [FieldOffset(0)]
   public uint Param_VertexCountPerInstance;
   [FieldOffset(4)]
   public uint Param_InstanceCount;
   [FieldOffset(8)]
   public uint Param_StartVertex;
   [FieldOffset(12)]
   public uint Param_StartInstance;
   ///
   /// Standard binding slot of B12
   /// 
   public static int BindingSlot { get { return 12; } 
}

Bind both the Append Buffer and the ‘count’ constant buffer to the shader

foliageEffect.VertexAppendBuffer.BindToShader(
   this.MonitoredGraphicsDevice,
   MonitoredGraphicsDevice.enumShaderStage.Compute,
   // The uavInitialCount sets the starting point for any Append operations.
   uavInitialCount: 0);

foliageEffect.CountConstantBuffer.BindToShader(
   this.MonitoredGraphicsDevice,
   MonitoredGraphicsDevice.enumShaderStage.Compute);

Bind all the shader stages and execute the Compute Shader

this.MonitoredGraphicsDevice.BindPixelShader(null);
this.MonitoredGraphicsDevice.BindVertexShader(null, null,
   SharpDX.Direct3D.PrimitiveTopology.TriangleList);

this.MonitoredGraphicsDevice.BindGeometryShader(null);
this.MonitoredGraphicsDevice.BindHullShader(null);
this.MonitoredGraphicsDevice.BindComputeShader(
   foliageEffect.GenerateFoliageShader);

// Its defined as being a 32x32 thread shader, so we need to execute that
// multiple times. Since we want to span our entire texture,
// we need to run it Width/32 times - so for a texture of width 2048 we want
// to run the shader 64 times.

this.MonitoredGraphicsDevice.GraphicsDevice.ImmediateContext.Dispatch(
   drapeTexture.Texture.Description.Width / 32,
   drapeTexture.Texture.Description.Width / 32, 1);

The Geometry Shader looks like this;


// The buffer containing information about the drape texture and other controlling variables
cbuffer PerDrapeBuffer : register(b11)
{
  float2 Param_DrapeSmallestXZCorner;
  float Param_DrapeWorldSize;
  float Param_DrapeTextureSize;

  float Param_BladeMaxHeight;
  float Param_TextureAtlasTileCount;
  float Param_FoliageRadiusFromCentre;
  float _filler4;

  // SmallestXZ, LargestXZ of the visible box.
  float4 Param_VisibleBox;
};

// The data we will create for each particle that we will elaborate into a billboard/imposter/fin later
struct FoliageFinVertex
{
  float3 Position;
  float BladeHeight;
  int BladeType;
  float Rotation;
};

// Scatter of foliage around a fin centre. 
static const float2 scatterKernel[8] =
{
  float2(0 , 0),
  float2(0.8418381f , -0.8170416f),
  float2(-0.9523101f , 0.5290064f),
  float2(-0.1188585f , -0.1276977f),
  float2(-0.207716f, 0.09361804f),
  float2(0.1588526f , 0.440437f),
  float2(-0.6105742f , 0.07276237f),
  float2(-0.09883061f , 0.4942337f)
};

// This is the top-down orthogonal view of the immediate surroundings
Texture2D Param_DrapeTexture : register(t10);
// This is the orthogonal height map - same world size and origin as the Drape texture
Texture2D Param_HeightTexture: register(t11);
// This is the Buffer we will add Fin Vertexes (particles) into.
AppendStructuredBuffer Param_AppendBuffer: register(u0);

// This gives me 32x32 threads.
[numthreads(32, 32, 1)]
void GenerateFoliage(uint3 threadID : SV_DispatchThreadID)
{
  // For example if Dispatch(2, 2, 2) is called on a compute shader with numthreads(3, 3, 3) SV_DispatchThreadID will have a range of 
  // 0..5 for each dimension.

  // theradId.xy is the first two dimensions - 
  float4 drapePixel = Param_DrapeTexture.Load(int3(threadID.xy,0));

  // Typically voxel size is 0.25m. To get good coverage we generate a scatter per pixel.
  float voxelSize = Param_DrapeWorldSize / Param_DrapeTextureSize;
	
  // Need to invert y coord when sampling from the height map
  float height = Param_HeightTexture.Load(int3(threadID.x, Param_DrapeTextureSize - threadID.y, 0));
  // Generate a reference world position
  float3 worldPosition = float3(
    Param_DrapeSmallestXZCorner.x + (threadID.x * voxelSize), 
    height, 
    Param_DrapeSmallestXZCorner.y + (threadID.y * voxelSize));

  // Work out where the camera must have been to generate the drape texture we see
  float2 cameraCentre = float2(
     Param_DrapeSmallestXZCorner.x + (Param_DrapeWorldSize / 2), 
     Param_DrapeSmallestXZCorner.y + (Param_DrapeWorldSize / 2));
		
  // Guard code to make sure we are sampling within the scope of the texture. 
  if (threadID.x < (uint)Param_DrapeTextureSize && threadID.y < (uint)Param_DrapeTextureSize)
  {
    // Make sure we are within our clipping radius
    if (length(worldPosition.xz - cameraCentre) = Param_VisibleBox.x && worldPosition.x = Param_VisibleBox.y && worldPosition.z <= Param_VisibleBox.w)
      {
    
        //
        // In a real example we wouldnt just create a Fin for every pixel - we'd sample the colour of the 
        // drape and other aspects which would control the planting decision. Here were just creating a 
        // grass patch for every pixel
        //


        // Create a scatter around the single pixel sampled from the drape.
        [unroll(8)]
        for (int scatterKernelIndex = 0; scatterKernelIndex < 8; scatterKernelIndex++)
        {

          float3 finPosition = float3(worldPosition.x + (scatterKernel[scatterKernelIndex].x * voxelSize), 
                                    worldPosition.y, 
                                    worldPosition.z + (scatterKernel[scatterKernelIndex].y * voxelSize));

          FoliageFinVertex fin = (FoliageFinVertex)0;
          fin.Position = finPosition;
          //
          // We should vary the grass patch height randomly or via Simplex Noise - but for the example we'll leave them constant
          //
          fin.BladeHeight = Param_BladeMaxHeight ;
          fin.BladeType = Param_TextureAtlasTileCount - 1;
          //
          // Again; we've rotate based on simplex, but for the sake of example we'll leave it at 0.
          //
          fin.Rotation = 0;
          
          // Add the fin location to the append buffer
          Param_AppendBuffer.Append(fin);
        }
      }
    }
  }
}

The Counting Step

Great – we now have a memory buffer on the GPU filled with thousands of FoliageFinVertex structures – so now we want to render them via a VS/PS pair. But hold on a second – we don’t have a Vertex Buffer – how do we call Draw ? It takes a parameter which is the number of vertexes; how do we know how many we’ve created ?

We could drag the entire AppendBuffer back to the CPU via a Map command and count it, but this would cause a massive pipeline stall.

Luckily there is a pipeline command which allows us to copy the number of outputs generated to a much smaller GPU memory buffer. There is a fixed format for this small memory buffer – it must be uint4. We can make it less abstract by listing each field seperately showing what each uint means by creating a Constant Buffer as follows.

// Constant Buffer into which the append buffer data has been copied
[StructLayout(LayoutKind.Explicit, Size = 16)]
public struct VertexCountConstants 
{
  [FieldOffset(0)]
  public uint Param_VertexCountPerInstance;
  [FieldOffset(4)]
  public uint Param_InstanceCount;
  [FieldOffset(8)]
  public uint Param_StartVertex;
  [FieldOffset(12)]
  public uint Param_StartInstance;

  /// <summary>
  /// Standard binding slot of B12
  /// </summary>
  public static int BindingSlot { get { return 12; } }

}

We create a Constant Buffer to hold this counting data and call it CountConstantBuffer

Immediately after our Dispatch call to fill the AppendBuffer we call the CopyStructureCount method which interrogates the buffers size and writes the result to the CountConstantBuffer.

this.MonitoredGraphicsDevice.GraphicsDevice.ImmediateContext.CopyStructureCount(
   dstBufferRef: foliageEffect.CountConstantBuffer.Buffer, 
   dstAlignedByteOffset: 0, 
   srcViewRef: foliageEffect.VertexAppendBuffer.UnorderedAccessView);

Now we have a really small buffer on the GPU holding the data about how much was created by our AppendBuffer call. This is useful, but getting it back to the CPU would again stall the pipeline, and it still wouldn’t give us the number of vertexes we want to create – we want to create 6 Vertexes per particle to form our Quad billboard.

The key to using this data is to use the DrawInstancedIndirect pipeline method. This method is the same as “DrawInstanced” and takes the same parameters but instead of the parameters being passed by the Program to the shader, the shader is instructed to get the parameters from an existing GPU memory buffer.

The format of the buffer for passing these parameters is particular – it must be a uint4 ; heres my one;

uint[] bufferData = new uint[4];
SharpDX.Direct3D11.Buffer drawIndirectArgumentsBuffer =
  SharpDX.Direct3D11.Buffer.Create(
    device,
    bindFlags: SharpDX.Direct3D11.BindFlags.UnorderedAccess |
               SharpDX.Direct3D11.BindFlags.ShaderResource,
    data: bufferData,
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    optionFlags:
       SharpDX.Direct3D11.ResourceOptionFlags.DrawIndirectArguments,
    structureByteStride: sizeof(uint));

SharpDX.Direct3D11.UnorderedAccessView drawIndirectArgumentsBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    device,
    drawIndirectArgumentsBuffer,
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
    {
       Format = SharpDX.DXGI.Format.R32_UInt,
       Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
       Buffer = new
       SharpDX.Direct3D11.UnorderedAccessViewDescription.BufferResource()
       {
         FirstElement = 0,
         ElementCount = bufferData.Length,
         Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.None
       }
    });

So how do we get our existing VertexCountConstants constant buffer into the DrawIndirectArgumentsBuffer, especially when we need to multiply it up by 6 ? Back to Compute Shaders again. Just pass both buffers to a compute shader and let it fill the DrawIndirectArgumentsBuffer from the data in the VertexCountConstants;

// Release the heigth and drape textures so they can be written to again.
foliageEffect.DrapeTexture.ReleaseFromShader(this.MonitoredGraphicsDevice, 
  MonitoredGraphicsDevice.enumShaderStage.Compute);

foliageEffect.HeightMap.ReleaseFromShader(this.MonitoredGraphicsDevice, 
  MonitoredGraphicsDevice.enumShaderStage.Compute);

// This causes the counter from the vertex append buffer to be written to 
// the appendCountConstantBuffer
foliageEffect.DrawIndirectArgumentsBuffer.BindToShader(
  this.MonitoredGraphicsDevice, 
  MonitoredGraphicsDevice.enumShaderStage.Compute, 
  uavInitialCount: 0);
                

// Now call the CS again to write the constant buffer variables into 
// the parameter buffer. We want to multiply up the number of 
// vertexes generated, so we need to render this data again.
this.MonitoredGraphicsDevice.BindComputeShader( 
  foliageEffect.CountVertexesShader);
                this.MonitoredGraphicsDevice.GraphicsDevice.ImmediateContext.Dispatch(
   1, 1, 1);

Heres the corresponding compute shader;


numthreads(1, 1, 1)
void CountVertexes(uint3 id : SV_DispatchThreadID)
{
   if (id.x == 0 && id.y == 0 && id.z == 0)
   {
     // We multiply by 6 because we want two triangles rendered for 
     // this position.
     Param_DispatchIndirectArguments[0] = Param_VertexCountPerInstance * 6;
     // InstanceCount
     Param_DispatchIndirectArguments[1] = 1;
     // StartVertex
     Param_DispatchIndirectArguments[2] = 0;
     // StartInstance
     Param_DispatchIndirectArguments[3] = 0;
   }
}

Eventually, we will come onto actually rendering the billboards. I wont post my billboard VS/PS pair here as they are really standard code. The key is how this VS/PS pair is now called;

// Pass vertex buffers - none in this case.
this.MonitoredGraphicsDevice.GraphicsDevice.ImmediateContext.
  InputAssembler.SetVertexBuffers
  (
    0, 
    new SharpDX.Direct3D11.VertexBufferBinding[] { }
  );
// No indexes either
this.MonitoredGraphicsDevice.Indices = null;
            
// Need to use DrawIndirect here.
this.MonitoredGraphicsDevice.Context.DrawInstancedIndirect
  (foliageEffect.DrawIndirectArgumentsBuffer.Buffer, 0);

Because you are not passing any VertexBuffer data into the VS it looks a little different to normal;


// The vertex shader code. Simple VSPS pair
psIn_FoliagePatch vsFoliagePatch_FromAppendBuffer(uint vertexID:  SV_VertexID)
{

  // Get the correct vertex definition. Since we are creating 6 vertexes 
  // for each item in the vertrexes StructuredBuffer we should divide 
  // the vertexID by 6.
  uint foliageFinID = floor(vertexID / 6); 
  // finVertexID is the n'th vertex for the specific fin, from 0->5
  // later code in the VS can create an offset from the particle location
  // based on the foliageFinID for the top-left, bottom-left etc vertex
  // relative positions.
  uint finVertexID = vertexID - (foliageFinID * 6);
	
  FoliageFinVertex vsInput = Param_VertexReadBuffer[foliageFinID];
  ...

Conclusions

This whole method can be used to replace an existing CPU based method for passing in VertexBuffers into an existing Imposter/fin renderer. Its advantage over CPU variants are;

  1. Making use of an existing orthogonal heightmap and drape
  2. Uses no CPU resources at all, and no data transfer back from the GPU
  3. Can be tuned really easily in various places