Multiple Meshes – Single Pass,Ultra Batching

Basic Batched Rendering

Its an axiom in DirectX that you should aim for as few calls of the API from the CPU as possible, and require as few state changes as possible. The basic reasons for this are;

  1. A call from the CPU to the GPU might generate a sync-point between the two parallel processors.
  2. A call to the API is relatively expensive, because it it is indcitive of the CPU having carried out some work to prepare the call, and that unit of work is going to be slower on the CPU than the GPU.
  3. Swapping state (depth, render mode, alpha blend) or resources (texture bindings) means that all the work currently undertaken by the GPU must finish before the work dependent on the state change can be done, and this harms paralellisation.

Typically this is done by ordering meshes by texture, and using Instancing to pass through the parameters for multiple instances of the same mesh. In this way you can render many thousands of meshes and pass through their position, rotate and scale variations through an Instance Buffer.

If all the instance information is known at design time, this can be accomplished via a call to DrawIndexedInstanced() and pass through the Index Buffer (IB), Vertex Buffer (VB) and Instance Buffer (InstB). Once call and many repeated objects.

There are variations including the Indirect family of calls, where the Instance Buffer information is calculated at runtime in a Compute Shader, with the Instance Buffer being filled dynamically. The Instance Buffer never exists on the CPU so the call to DrawIndexedInstanceIndirect() passes only the IB, VB and a pointer to the InstB.

In all cases the Draw call needs to know how many Vertexes (or Indexes) to draw, and how many Instances. In the case of Indirect, that data is passed via (yet another) Buffer because the CPU doesn’t know how many InstB will be created.

Texture data can either be passed in via a Texture Atlas (a single big texture containing all the sub-textures tessellated) or via a Texture Array. The Texture Array does eliminate all the tedious mucking about required to make a Texture Atlas work, as well as the problems relating to mip-mapping of the Texture Atlas. One side effect of the Texture Array to watch out for is texture-cache hits; if a batch of vertexes swaps backward and forward between texture atlas indexes as they are being drawn, the GPU isn’t as efficient becuase the last texture it used has a fair probability of being needed to be swapped out for the next texture.

This is all very efficient at rendering the same mesh (or same portion of a mesh) many times in a single call.

What if you have many different meshes though ?

Ultimately you need at least one DrawIndexedInstanced call per mesh. If you have a village of 20 different types of house, thats at least 20 calls. Plus the same number again for shadows, perhaps another 20 for depth pre-pass. Now we’re up to 60 calls. Ouch.

Ultra Batched Rendering

Taking the example of 20 houses in a village. 20 different meshes and texture packs are downloaded from your friendly model provider (TurboSquid in my case). Lets assume that all the textures can be pre-processed to be the same size and can therefore exist in the same Texture Array, we can achieve basic batched rendering with the cost of 20 seperate calls, one per model.

In between each of the calls, we will swap out the Texture Array binding, bind a new VB and a new IB, and create an InstB. In my case the InstB will be created using a Compute Shader from a set of reference locations for the landscape tile, and culled based on visibility, so the InstB is left with only those locations which are visible.

We’ll have many binding changes and at last 20 compute shader calls, and 20 Draw calls. A draw call must have a parameter giving the number of primitives to be rendered; just list a Compute Shader call must know how many threads need to be executed – and probably for exactly the same reasons (conceptually a Vertex Shader is just a special type of Compute Shader).

This process has the following stages per model;

  1.  Set the compute shader
  2.  Set the model parameters, and candidate instance list
  3.  Dispatch the compute shader
  4.  Copy the append buffer count to a new buffer
  5.  Set a compute shader designed to create the Indirect parameters buffer
  6.  Dispatch the Indirect parameters generation compute shader
  7.  Pass the Indirect parameters buffer to a call to DrawIndirect().

Let us assume that, by some happy accident, two of the models have exactly the same number of primitives. We could merge the two models into a single large model, and all the textures into a single large texture array. When creating out InstB buffer in the compute shader in this case we’ll also write a property to the InstB telling the future Vertex Shader call what primitive offset and texture offset to apply to the VertexID passed into it. Our code would look a bit like this;

effect.SetParameter("ModelInfo", new ModelInfo(startVertex:0, startTextureArray:0);
effect.SetParameter("ModelInfo", new ModelInfo(startVertex:10232, startTextureArray:12);

effect.SetParameter("IndirectParameterBuffer", parameterBuffer);

.. later ..

Our compute shader would look something like this;

  InstanceData instData = CandidateInstances[threadID.x];
  .. do some visiblility calculations ..
  InstanceVertex instVertex = (InstanceVertx)0;
  instanceVertex.Position = ...
  instanceVertex.PrimitiveOffset = modelInfo.startVertex;
  instanceVertex.TextureArrayOffset = modelInfo.startTextureArray;

Each time the VertexShader is executed it receives a VertexID, IndexID and InstanceID set of system-value parameters. Instead of passing the Vertex Buffer and Index Buffer using the SetVertexBuffer() and Indexes properties of the API we will pass them as Structured Buffers.

The vertex shader would look something like this;

psIn vs(vertexID: SV_VertexID, indexID: SV_IndexID, instanceID: SV_InstanceID)
   // First get our instance data 
   InstanceVertex instanceVertex = VisibleInstancesBuffer[instanceID];
   // Now read our offset primitive ID
   int indexID = indexID + instanceVertex.PrimitiveOffset;
   // Now read our vertex data
   ModelVertex modelVertex = ModelVertexesBuffer[indexID];
   // Now get our texture index
   int textureArrayIndex = modelVertex.TextureID + instanceVertex.TextureOffset;
   .. Do normal vertex shader processing ..

So we now have one compute shader call per house type, but only one eventual Draw call.

Remember this is only possible because both meshes have the same number of primitives; and this is a requirement because the one Draw call must be told how many times to execute the Vertex Shader.

Its really unlikely that your models will all have the same primitive count. But with a bit of lateral thinking this approach can be made to work for any combinations of arbitrary models. One draw call, no matter how many instances and how many models you have.

Model Sharding

I have no idea what this is called in the DirectX gaming community so I’ve called this “model sharding”. Since a model from an author will be an arbitrary size of primitives but we need a model with a known size; we’ll simply split the model into shards of a known size as a pre-process stage. This means we know empirically how many primitives will be in the Draw call, because it will always be the same amount.

The process is;

  • We choose an arbitrary number of vertexes; typically 128, but you can experiment.
  • When reading the model, its vertexes are very likely to be aligned with texture resources into blocks (faces, submeshes etc). When reading in the model, output it in blocks of 128 vertexes, making up any shortfall with a number with a signature value in them indicating they are degenerate vertexes. We’ll call this 128 primitive set a Shard; a Shard can only relate to a single texture in the Texture Array. If neccessary split them and emit more Shards.
  • Record the total number of Shards in the model.

The compute shader processing is more complex than before. Since the model is now split into 128 primitive shards, we need to emit an Instance Vertex per shard;

[threads(100,0,0)] GenerateInstanceBuffer() 
  InstanceData instData = CandidateInstances[threadID.x]; 
  .. do some visiblility calculations .. 
  InstanceVertex instVertex = (InstanceVertx)0; 
  instanceVertex.Position = ... 
  for(int shard =0; shard < modelInfo.ShardCount; shard++)
     instanceVertex.PrimitiveOffset = modelInfo.startVertex + (shard * 128);
     instanceVertex.TextureArrayOffset = modelInfo.startTextureArray;



We will have a lot more VisibleInstance entries than before; in some cases many of hundreds of times more.

The only other extra piece of work to do is deal with the degenerate vertexes that a shard might contain. My favoured approach is to modify the W value of the Position after transformation in the vertex buffer to a large value outside of the viewport depth.

The nice thing about this approach is that all the hard work in the Compute stage does not affect the rendering pipeline at all, since compute shaders dont have state associated with them, and secondly each Instance call into the vertex shader will only ever address one texture in the Texture Array; which should help with cache coherence.


I dont know yet. I’ll let you know when I’ve coded it.



Landscape Surface Decals

A landscape surface using a LOD algorithm suffers from the problem of uniformity. The example below shows a fractal generated landscape with varying textures based on fractal noise and slope angle. Its a reasonable interpretation of a naturalistic landscape but is completely uniform over a large scale.


Real landscapes have a set of highly localised patterns on them representing manmade objects such as roads, fields etc. as well as natural ‘intrusions’ such as rocks, streams etc. On a LOD based landscape it is not possible to add vertexes into the terrain mesh that allow for detailed placement of each of these items becasue the regular LOD grid would be disrupted and its very difficult to accurately ‘insert’ new mesh vertexes representing linear or point features into an existing regular mesh.

Triangulation becomes extremely difficult to control and the difference in mesh resolution creates visible artefacts.


To resolve this we can use Decals to apply meshes on top of the landscape mesh at runtime. The procedure goes like this (when using deferred rendering);

  • Prepare a series of Decal meshes which will overlay your landscape mesh. In the above image the Blue road mesh is my Decal.
  • Calculate the decal height values using the same algorithm as you use to genate the heights in the landscape mesh.
  • During the render pass, render the landscape mesh as usual to the deferred render target.
  • Prepare a new render target ‘DecalRenderTarget’ the same size as the deferred screen render target.
  • Create a new ShaderResourceView that looks at the depth buffer used during the landscape mesh render pass, and pass that in as a texture into the Decal Mesh shader.
  • Turn off Depth checking.
  • Render the Decal mesh to the DecalRenderTarget, and within the pixel shader, consult the Depth Texture passed in which contains the depth information for landscape at that pixel; compare it with the depth (z) value that is passed into the pixel shader.
  • In the pixel shader clip the pixel if it is grossly different in height to the landscape
float sceneDepth = Param_SceneDepth.Load(int3(psInput.Position.x, psInput.Position.y, 0));
float bias = psInput.Position.z * 0.1f;
float pixelDepth = psInput.Position.z + bias;
clip(pixelDepth - sceneDepth);
  • This will ensure that pixels roughly the same height as the landscape get written to the DecalRenderTarget, but because they dont use depth buffer checking you can apply any bias you need to to eliminate decal geometry which would be invisible in the landscape. The actual bias applied depends on the difference in resolution between the landscape mesh and the Decal Mesh.
  • During the Forward Rendering step, which doesn’t use a depth buffer, render the landscape mesh render target and then the DecalRenderTarget over the top using alpha clipping to eliminate any unused pixels in the DecalRenderTarget


Above: The DecalRenderTarget showing a road decal mesh and a landscape local colour mesh rendered to the empty DecalRenderTarget


Above: The DecalRenderTarget rendered on top of the standard landscape mesh.

This technique allows a arbitrary number of decal meshes to be overlaid onto a landscape, and its only weakness is that the landscape mesh its being projected onto should be realtively uniform and the decal mesh having roughly the same level of detail – too wide a variation and the bias calculation will need to overcorrect leading to visual artefacts.

Grass Fin Fill Rate

The previous post illustrated a method of generating grass on the GPU using Compute Shaders. One thing that this doesn’t improve upon is the problem of Fill Rate.

The Fill Rate is the amount of times a specific pixel on screen is written to by the GPU. Its really easy to see if your application is Fill Rate restricted; just resize the viewport. If it gets more FPS then you are Fill Rate throttled.

Where the previous post talked about the use of billboards/imposters/fins to generate complex grass, it ignored the problem of the resulting fill rate impact. All those GPU generated quads had a pixel hit rate of about 10% (i.e. 90% of the texture space was alpha 0 and culled). Because clip() was being used rather than alpha blending it wasn’t as bad as it could have been, but the use of Clip() prevents early triangle rejection by the shader. Since it cannot assume, despite being in Opaque alpha mode, that any triangle is actually going to cover any other triangle it must render them all.

There isn’t really any way around this if you are using imposters; so I created a grass model patch of opaque blades and rendered that. I looked for a good low poly model but they were all generated for use in static scenes and grass is a pretty easy model to dynamically generate;

1) Generate a 2D grid of seed points. The model size should be the same as the voxel size of the orthographic planting texture discussed in the last post, such that each pixel of that (and consequent AppendBuffer point) represents a single instance of the grass patch.
2) For each seed point, generate a 0 origin blade of 7 points – two quads with a triangle on top.
3) Pick a ‘direction’ normal based a random value X,Z and translate the resulting blade points based on the square of the Y value; this means the grass bends towards the intended direction more acutely the taller it is, with the base quad being only slightly tilted toward the ‘direction’.
4) Rotate the blade points around the Y axis in a random rotation.
5) Calculate the normals of each triangle.
6) Translate to the model seed point.

Once that is created it can be set as a shader resource;

// Buffer that will be filled with the model triangle list.
struct ModelVertex
	float3 Position;
	float3 Normal;
	float2 TexCoord;
StructuredBuffer  Param_ModelBuffer : register(t15);

I create three of these models be eliminating every other seed point, to generate a less and less dense model. While doing that I double the width of the remaining grass blades.

To finish off the effect I blend the grass blade color into the ground color by lerp’ing the distance from the viewer; I do the same trick with the grass blade normal as well. This helps it blend into the flat ground texture.

You can see the effect in the YouTube video here

Unblocking the Bottleneck – Better Grass

So having tried Geometry Shader generation and simple “Big Vertex Buffer” grass generation I stumbled on the post which illustrated a pretty common (but unknown to me) method of vertex generation on the GPU that looks like it could be the solution I was after.

I searched for some time for a really clear guide to what was going on, and although the codemasters blog hinted and the basics, what I needed was to follow a step-by-step to see how it worked.

I couldn’t find one, so I thought I’d write one.


The method assumes you are generating an orthogonal (look down) image of the immediate surroundings of the eye camera in world space. I typically create two orthogonal textures per frame; one is a simple rendering of the basic landscape (without trees or other infill) and the second is a heightmap. This guide doesn’t describe creating those – it assumes you have them.

The Method

Code later; first a description of how you go about doing the work.

  1. Create a Generation Compute Shader and an Append Buffer. This will generate vertexes by examining the two orthogonal textures generated above. As it samples the two textures it will emit (append) new Vertexes for each world location where a grass blade should be, and sample the height map to work out how high it is.
  2. Create a Vertex Count Compute Shader and a Unordered Access Buffer to store the count of the vertexes generated in stage 1. This is fiddly to understand but crucial for performance
  3. Create a Vertex and Pixel Shader pair for rendering the vertexes generated in stage 1, along with a Structured Buffer to transfer the data from the Append Buffer so it can be used.

Thats it. You will of course need some grass textures (imposters/fins) to actually render, but this example will be completed when we are rendering a million quads.

The ‘trick’ to good performance with this method is that nothing must be allowed to move from the CPU to the GPU in the entire sequence of rendering. This means all of this is done on the GPU and nothing transits the memory pipeline.


You may not have come across Computer Shaders and Append Buffers before; so what are they ?

A Compute Shader is a arbitrary piece of code executed on the GPU with massive parallelization. Its important to note that the order in which your code is executed is not defined so the outputs from the Compute Shader can come out in any order, and this is important if you are using the Compute Shader to generate mesh vertexes. Normally you would output a mesh in some kind of winding order but if you are looking to generate a mesh from a Compute Shader you have to lose control over the winding order and on that basis; its not a good idea to try to generate meshes from them, so we output Particles which we later expand into meshes. For this reason sometimes the method shown here is termed Particle Shading.

We’re used to writing Vertex Shader code and Pixel Shader code both of which have pretty well-defined inputs and outputs; a VS takes a Vertex from a Vertex Buffer and outputs a struct per-vertex. The values of the struct are interpolated (using a hidden Compute Shader step) to provide a per-pixel set of values for the same struct to pass into the Pixel Shader. The PS outputs a set of scalar values, typically just a float4 set of colours (although sometimes more when performing multiple render-target operations).

The key to understanding Compute Shaders is to realise the VS and PS pair are just specialised forms of Compute Shader – they have very distinct input and output expectations which the inherit from their fixed function pipeline past; but ultimately they are just Compute Shaders with added restrictions.

A Compute Shader can take its input from any Buffer passed in as a resource, and write its output to any Buffer passed in as a resource. The VS can only read from the buffer passed in as a “VertexBuffer” and the PS can only write to the resource passed in as a “RenderTarget”; they are just very restricted forms of Compute Shader.

In this post I will be using my Compute Shader to write to an AppendBuffer and read from a Texture. An AppendBuffer is an optimised memory structure which can only be written to, so is perfect for creating new geometry.

Using Compute Shaders

Thinking back to a VS/PS pair, we can easily visualise how many times the VS is going to be called, becuase its simply the number of Indexes specified in the DrawIndexed call. The Struct passed into the VS is the n’th item in the input buffer of Structs which describe the geometry.

When we call a Compute Shader its not operating on any specific Buffer resource thats been bound to the shader step – it could operate one, many or none of the Buffers that it can access. The Compute Shader is simply told to execute ‘n’ times by the programmer using two controlling parameters.

In HLSL the Compute Shader must have the following prefix;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)


The actual numbers here get presented to the Compute Shader code via its input struct threadID. This is a 3 dimensional uint which tells the programmer which thread the Compute Shader is currently running. In the above example the value of threadID will be threadId.x 1->32, threadId.y 1->32, threadId.z = 1. This means the Compute Shader will be executed 32x32x1 times.

If we are intending to sample a Texture2D resouce passed into the shader, we can use the threadId as a parameter to the Load function to load and examine a specific pixel.

Texture2D inputTexture;
uint2 textureSize;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
     float4 pixelValue = inputTexture.Load(threadID.x,threadID.y);

Note the need for the guard code; having told the shader to execute 32×32 times we need to make sure we only sample the texture within the textures boundaries (which we explicitly pass in via the textureSize parameter).

What should we do with the sampled pixel ?

Texture2D inputTexture;
uint2 textureSize;
AppendBuffer foliageLocation;
float voxelSize;
float2 worldOffset;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
         worldOffset.x + (threadID.x * voxelSize),
         worldOffset.y + (threadID.y * voxelSize)

In the example of a Compute Shader sampling a texture it makes sense to define a sensible set of numthreads() in the shader such as 32,32,1 and then calculate the best number to use within the program – its there where we can tune the number of threads to the actual size of the texture being passed in;

  device.ImmediateContext.Dispatch(textureSize.x / 32,textureSize.y / 32,1);

In the above case my shader, for a texture size of 100×100 pixels, will be called 16 times (4×4) and the ThreadID.x and ThreadID.y values will go from 0 to 128. We must always bear in mind that shader may be called with parameters outside the bounds of the Texture we are passing in and this is even more important if the shader is accessing a data structure that is not boundary-checked; we could easily get errant results by sampling beyond an array boundary.

The Code

The Generation Step

There will be lot of code examples, and they use my own SharpDX library to prevent code bloat. The referenced FoliageEffect class is simply a collection of the resources needed to complete the render, and it just exposes out SharpDX resources to the render loop. Hopefully its clear as to how it works – I’m happy to add more code examples if anything is not clear.

Create a constant buffer (DrapeConstants) which describes the orthogonal “drape” texture that we will use to drive the generation;

// Cull the elements generated to those we can actually see
BoundingBox bb = BoundingBox.FromPoints(eyeCamera.ViewFrustum.GetCorners());
// Create a struct to pass to the shader 
  drapeConstants = new RenderedComponents.FoliageEffect.DrapeConstants();

// give a world bottom left offset
drapeConstants.DrapeSmallestXZCorner = new Vector2(

// tell the shader the texture pixel size and world size
drapeConstants.DrapeTextureSize = drapeTexture.Texture.Description.Width;
drapeConstants.DrapeWorldSize = drapeTextureWorldRectangle.Width;

// tell the shader how tall we will want the generated grass
drapeConstants.BladeMaxHeight = 0.35f;

// tell the shader how far from the camera position that teh grass will 
// be visible from
drapeConstants.FoliageRadiusFromCentre = 20.0f;

// tell the shader how much of the world is currently visible. We
// will use this to restrict the zone of the texture that we need to sample
drapeConstants.VisibleBox = new Vector4(
  bb.Minimum.X, bb.Minimum.Z, bb.Maximum.X, bb.Maximum.Z);

// We will later pass in a texture array of foliage vertexes; we pass
// in here how big that array will be.
drapeConstants.TextureAtlasTileCount = 7;

// Updae the buffer and pass it into the Compute Shader stage.
  ref drapeConstants,

Next bind the ‘drape’ texture and heightmap to the shader ready for sampling

// Bind a texture buffer containing the orthogonal 'drape' picture of our 
// immediate surroundings, and resource view to the compute shader stage
foliageEffect.DrapeTexture.Texture = drapeTexture.Texture;
foliageEffect.DrapeTexture.ShaderResourceView =

// Bind the height map and its SRV to the compute shader stage
foliageEffect.HeightMap.Texture = heightTexture.Texture;
foliageEffect.HeightMap.ShaderResourceView = heightTexture.ShaderResourceView;

Create a Buffer which will be used for the generated vertexes to be appended to. In DirectX 11 it has the concept of an append-only buffer that a compute shader can write to. Declare it like this (the value InitialData is a single dimension array with enough space to fit the maximum number of grass vertexes in, in theory one element per pixel – its only used in initialization and can be discarded afterwards).

// The Vertex append buffer is a Unordered Access Buffer
SharpDX.Direct3D11.Buffer vertexAppendBuffer =
    SharpDX.Direct3D11.BindFlags.UnorderedAccess |
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    // Will not be read back to the CPU
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    optionFlags: SharpDX.Direct3D11.ResourceOptionFlags.BufferStructured,
    structureByteStride: FoliageFinVertex.GetSize());

// Create a view on this texture
SharpDX.Direct3D11.UnorderedAccessView vertexAppendBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
      Format = SharpDX.DXGI.Format.Unknown,
      Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
      Buffer = new
        FirstElement = 0,
        // This represents the maximum number of elements that can be in
        // the buffer, in theory one value per pixel of the drape texture
        ElementCount = initialData.Length,
        Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.Append

The struct FoliageFinVertex is the data that will be generated per blade of grass by the Geometry Shader. My simple set of data looks like this;

Create a Buffer which will store the number of vertexes generated.

[StructLayout(LayoutKind.Explicit, Size = 16)
public struct VertexCountConstants
   public uint Param_VertexCountPerInstance;
   public uint Param_InstanceCount;
   public uint Param_StartVertex;
   public uint Param_StartInstance;
   /// Standard binding slot of B12
   public static int BindingSlot { get { return 12; } 

Bind both the Append Buffer and the ‘count’ constant buffer to the shader

   // The uavInitialCount sets the starting point for any Append operations.
   uavInitialCount: 0);


Bind all the shader stages and execute the Compute Shader

this.MonitoredGraphicsDevice.BindVertexShader(null, null,


// Its defined as being a 32x32 thread shader, so we need to execute that
// multiple times. Since we want to span our entire texture,
// we need to run it Width/32 times - so for a texture of width 2048 we want
// to run the shader 64 times.

   drapeTexture.Texture.Description.Width / 32,
   drapeTexture.Texture.Description.Width / 32, 1);

The Geometry Shader looks like this;

// The buffer containing information about the drape texture and other controlling variables
cbuffer PerDrapeBuffer : register(b11)
  float2 Param_DrapeSmallestXZCorner;
  float Param_DrapeWorldSize;
  float Param_DrapeTextureSize;

  float Param_BladeMaxHeight;
  float Param_TextureAtlasTileCount;
  float Param_FoliageRadiusFromCentre;
  float _filler4;

  // SmallestXZ, LargestXZ of the visible box.
  float4 Param_VisibleBox;

// The data we will create for each particle that we will elaborate into a billboard/imposter/fin later
struct FoliageFinVertex
  float3 Position;
  float BladeHeight;
  int BladeType;
  float Rotation;

// Scatter of foliage around a fin centre. 
static const float2 scatterKernel[8] =
  float2(0 , 0),
  float2(0.8418381f , -0.8170416f),
  float2(-0.9523101f , 0.5290064f),
  float2(-0.1188585f , -0.1276977f),
  float2(-0.207716f, 0.09361804f),
  float2(0.1588526f , 0.440437f),
  float2(-0.6105742f , 0.07276237f),
  float2(-0.09883061f , 0.4942337f)

// This is the top-down orthogonal view of the immediate surroundings
Texture2D Param_DrapeTexture : register(t10);
// This is the orthogonal height map - same world size and origin as the Drape texture
Texture2D Param_HeightTexture: register(t11);
// This is the Buffer we will add Fin Vertexes (particles) into.
AppendStructuredBuffer Param_AppendBuffer: register(u0);

// This gives me 32x32 threads.
[numthreads(32, 32, 1)]
void GenerateFoliage(uint3 threadID : SV_DispatchThreadID)
  // For example if Dispatch(2, 2, 2) is called on a compute shader with numthreads(3, 3, 3) SV_DispatchThreadID will have a range of 
  // 0..5 for each dimension.

  // theradId.xy is the first two dimensions - 
  float4 drapePixel = Param_DrapeTexture.Load(int3(threadID.xy,0));

  // Typically voxel size is 0.25m. To get good coverage we generate a scatter per pixel.
  float voxelSize = Param_DrapeWorldSize / Param_DrapeTextureSize;
  // Need to invert y coord when sampling from the height map
  float height = Param_HeightTexture.Load(int3(threadID.x, Param_DrapeTextureSize - threadID.y, 0));
  // Generate a reference world position
  float3 worldPosition = float3(
    Param_DrapeSmallestXZCorner.x + (threadID.x * voxelSize), 
    Param_DrapeSmallestXZCorner.y + (threadID.y * voxelSize));

  // Work out where the camera must have been to generate the drape texture we see
  float2 cameraCentre = float2(
     Param_DrapeSmallestXZCorner.x + (Param_DrapeWorldSize / 2), 
     Param_DrapeSmallestXZCorner.y + (Param_DrapeWorldSize / 2));
  // Guard code to make sure we are sampling within the scope of the texture. 
  if (threadID.x < (uint)Param_DrapeTextureSize && threadID.y < (uint)Param_DrapeTextureSize)
    // Make sure we are within our clipping radius
    if (length(worldPosition.xz - cameraCentre) = Param_VisibleBox.x && worldPosition.x = Param_VisibleBox.y && worldPosition.z <= Param_VisibleBox.w)
        // In a real example we wouldnt just create a Fin for every pixel - we'd sample the colour of the 
        // drape and other aspects which would control the planting decision. Here were just creating a 
        // grass patch for every pixel

        // Create a scatter around the single pixel sampled from the drape.
        for (int scatterKernelIndex = 0; scatterKernelIndex < 8; scatterKernelIndex++)

          float3 finPosition = float3(worldPosition.x + (scatterKernel[scatterKernelIndex].x * voxelSize), 
                                    worldPosition.z + (scatterKernel[scatterKernelIndex].y * voxelSize));

          FoliageFinVertex fin = (FoliageFinVertex)0;
          fin.Position = finPosition;
          // We should vary the grass patch height randomly or via Simplex Noise - but for the example we'll leave them constant
          fin.BladeHeight = Param_BladeMaxHeight ;
          fin.BladeType = Param_TextureAtlasTileCount - 1;
          // Again; we've rotate based on simplex, but for the sake of example we'll leave it at 0.
          fin.Rotation = 0;
          // Add the fin location to the append buffer

The Counting Step

Great – we now have a memory buffer on the GPU filled with thousands of FoliageFinVertex structures – so now we want to render them via a VS/PS pair. But hold on a second – we don’t have a Vertex Buffer – how do we call Draw ? It takes a parameter which is the number of vertexes; how do we know how many we’ve created ?

We could drag the entire AppendBuffer back to the CPU via a Map command and count it, but this would cause a massive pipeline stall.

Luckily there is a pipeline command which allows us to copy the number of outputs generated to a much smaller GPU memory buffer. There is a fixed format for this small memory buffer – it must be uint4. We can make it less abstract by listing each field seperately showing what each uint means by creating a Constant Buffer as follows.

// Constant Buffer into which the append buffer data has been copied
[StructLayout(LayoutKind.Explicit, Size = 16)]
public struct VertexCountConstants 
  public uint Param_VertexCountPerInstance;
  public uint Param_InstanceCount;
  public uint Param_StartVertex;
  public uint Param_StartInstance;

  /// <summary>
  /// Standard binding slot of B12
  /// </summary>
  public static int BindingSlot { get { return 12; } }


We create a Constant Buffer to hold this counting data and call it CountConstantBuffer

Immediately after our Dispatch call to fill the AppendBuffer we call the CopyStructureCount method which interrogates the buffers size and writes the result to the CountConstantBuffer.

   dstBufferRef: foliageEffect.CountConstantBuffer.Buffer, 
   dstAlignedByteOffset: 0, 
   srcViewRef: foliageEffect.VertexAppendBuffer.UnorderedAccessView);

Now we have a really small buffer on the GPU holding the data about how much was created by our AppendBuffer call. This is useful, but getting it back to the CPU would again stall the pipeline, and it still wouldn’t give us the number of vertexes we want to create – we want to create 6 Vertexes per particle to form our Quad billboard.

The key to using this data is to use the DrawInstancedIndirect pipeline method. This method is the same as “DrawInstanced” and takes the same parameters but instead of the parameters being passed by the Program to the shader, the shader is instructed to get the parameters from an existing GPU memory buffer.

The format of the buffer for passing these parameters is particular – it must be a uint4 ; heres my one;

uint[] bufferData = new uint[4];
SharpDX.Direct3D11.Buffer drawIndirectArgumentsBuffer =
    bindFlags: SharpDX.Direct3D11.BindFlags.UnorderedAccess |
    data: bufferData,
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    structureByteStride: sizeof(uint));

SharpDX.Direct3D11.UnorderedAccessView drawIndirectArgumentsBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
       Format = SharpDX.DXGI.Format.R32_UInt,
       Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
       Buffer = new
         FirstElement = 0,
         ElementCount = bufferData.Length,
         Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.None

So how do we get our existing VertexCountConstants constant buffer into the DrawIndirectArgumentsBuffer, especially when we need to multiply it up by 6 ? Back to Compute Shaders again. Just pass both buffers to a compute shader and let it fill the DrawIndirectArgumentsBuffer from the data in the VertexCountConstants;

// Release the heigth and drape textures so they can be written to again.


// This causes the counter from the vertex append buffer to be written to 
// the appendCountConstantBuffer
  uavInitialCount: 0);

// Now call the CS again to write the constant buffer variables into 
// the parameter buffer. We want to multiply up the number of 
// vertexes generated, so we need to render this data again.
   1, 1, 1);

Heres the corresponding compute shader;

numthreads(1, 1, 1)
void CountVertexes(uint3 id : SV_DispatchThreadID)
   if (id.x == 0 && id.y == 0 && id.z == 0)
     // We multiply by 6 because we want two triangles rendered for 
     // this position.
     Param_DispatchIndirectArguments[0] = Param_VertexCountPerInstance * 6;
     // InstanceCount
     Param_DispatchIndirectArguments[1] = 1;
     // StartVertex
     Param_DispatchIndirectArguments[2] = 0;
     // StartInstance
     Param_DispatchIndirectArguments[3] = 0;

Eventually, we will come onto actually rendering the billboards. I wont post my billboard VS/PS pair here as they are really standard code. The key is how this VS/PS pair is now called;

// Pass vertex buffers - none in this case.
    new SharpDX.Direct3D11.VertexBufferBinding[] { }
// No indexes either
this.MonitoredGraphicsDevice.Indices = null;
// Need to use DrawIndirect here.
  (foliageEffect.DrawIndirectArgumentsBuffer.Buffer, 0);

Because you are not passing any VertexBuffer data into the VS it looks a little different to normal;

// The vertex shader code. Simple VSPS pair
psIn_FoliagePatch vsFoliagePatch_FromAppendBuffer(uint vertexID:  SV_VertexID)

  // Get the correct vertex definition. Since we are creating 6 vertexes 
  // for each item in the vertrexes StructuredBuffer we should divide 
  // the vertexID by 6.
  uint foliageFinID = floor(vertexID / 6); 
  // finVertexID is the n'th vertex for the specific fin, from 0->5
  // later code in the VS can create an offset from the particle location
  // based on the foliageFinID for the top-left, bottom-left etc vertex
  // relative positions.
  uint finVertexID = vertexID - (foliageFinID * 6);
  FoliageFinVertex vsInput = Param_VertexReadBuffer[foliageFinID];


This whole method can be used to replace an existing CPU based method for passing in VertexBuffers into an existing Imposter/fin renderer. Its advantage over CPU variants are;

  1. Making use of an existing orthogonal heightmap and drape
  2. Uses no CPU resources at all, and no data transfer back from the GPU
  3. Can be tuned really easily in various places

Geometry Shader Woes

So, generating lots of billboards and grass; Geometry Shader right ?

Depends on the GPU. I have a nice implementation of GS generated grass and billboards based on instanced GS from 2D points passed in a VB. Expansion of the points is done in two stages; by an instance buffer on the CPU giving the world location of a mesh of points, and then by tbe GPU GS instancing.

The results on a powerful GPU are good; less transfer of data to the GPU and substantially smaller models. Also the elimination of uneeded grass vertexes in the GS (depending on the result of a fractal noise calc) meant no need to pass in and eliminate degenerate triangles.

On a slightly lower spec card it tanked. Performance was aweful. Changing the code to a (much) larger model grass patch and passing it through a simple VS/PS shader was much faster by a factor of 4. There was a large number of degenerate triangles discarded between the VS and PS, which in the GS would never even have been generated, but still “the stopwatch never lies”.

Vertex Buffer Normals or HeightMap Generated ?

Simple answer; Vertex Buffer is faster.

Calculating normals in the pixel or vertex shader requires a minimum 3-tap linear interpolated sample for each normal. Plus the cost of keeping and passing in a height map texture. Sampling a texture is one of the more expensive calculations. For the cost of 12 bytes per vertex, I found that passing in a precalculated normal was faster, especially on less powerful cards.

Since it was neccesssary to carry out a state change for a landscape tile to pass in the tile’s VB it seeemed a good bet that passing in a new Heightmap texture wouldn’t hurt performance that much, but it did.

The CPU->GPU bandwidth required for the heightmap is 4 bytes per vertex, and with the Vertex Buffer its 12 bytes, but in the latter the interpolation is done on the VS->PS boundary by the automatic interpolation whereas with the Heightmap method you have to do the calcs yourself, with the additional cost of the texture sampling.



Global World

Long time with no posts …

I wondered how big I could make my world. Could I make it global ? What about the world generation time, how long would that take ?

What if it was nil, and the world was entirely procedurally generated ?

The components are already there;

  • Diamond square height generation
  • Trees, grass, rivers and erosion
  • DirectX11

There were two key things to overcome

  1. The system, once running, must have a baseline memory allocation that does not grow – consequently all items must be able to be regenerated on demand.
  2. The performance of key items like the diamond-square fractal must be really fast.

Regnerating Landscape

The first problem is solved by having a clear dependency concept behind any renderable object; small tiles need bigger tiles so they can get generated from the parent heights; trees need a heightmap so they can be located in the world. Rivers need heightmaps (and need to be able to deform the heightmaps). Linking all this up with the scene composition engine which had previously been able to assume all dependencies were available in the pre-generated landscape store was a big engineering challenge. The important structural changes were;

  • No component can demand a world resource, they can only request a world resource
  • Code must be resiliant to a resource not being available
  • Resource requests may cascade a number of further resource requests which may be carried out over multiple frames

Heightmap Generation Performance

I need the heightmap data on the CPU so I can query the meshes for runtime generation of trees, and pretty much anything else that needs to be height dependent, including the generation of a tiles vertex buffer. The CPU performance of the fractal based diamond square algorithm was just about OK, but the real issue came when trying to manipulate the resultant heightfield to overlay deformations (rivers , roads, building area platforms etc). The time required to query every height map point against a large set of deformation meshes was not acceptible.

The answer, like all things DirectX, was to use the shader to implement my diamond square fractal height generation. The steps to being able to implement this were;

  1. Read the undeformed parent height field in CPU.
  2. Prepare a texture for the child heightfield from one of the quadrants of the parent undeformed heightfield with every other pixel left empty, to be filled in the shader.
  3. Call the height generation shader passing the child height texture, and execute the diamond square code that fills in the missing pixels by reference to adjacent pixels.
  4. Record the output texture and read the data into the child tile class as the undeformed height map
  5. From another data structure describing landscape features like roads and rivers, obtain a vertex buffer which contains the deformations that the feature requires in terms of heightmap offsets
  6. Render the deformation vertex buffers over the top of the child heightmap
  7. Read back the newly deformed heightmap to the child tile CPU, to be used as the ‘true’ heightmap for all subsequent height queries and mesh generation.

All tiles have both a deformed and an undeformed heightmap data array stored. It took a long while to get to this solution, ultimately the problem was that the diamond square algorithm can only produce a new value with reference to the existing parent values – so it generates a very pleasant ‘random’ landscape, but it doesn’t allow for erosion, rivers, linear features, or any other ability to create absolute changes in height.

By storing the raw output of the diamond square algorithm, any deformations I need can be applied over the top of the raw heightfield and get the same perceived results at any resolution. Since my tile heightfields are only 129×129 pixels its not a lot of memory.

I immediately hit the problem of pipeline stalling when reading back the rendered heightfield data to the CPU, but a 2 frame delay injected between rendering the heightfield and reading it back was sufficient to remove the stutter. This problem is well documented and relates the underlying architecture and programming models of GPUs – although the programmer issues commands to the GPU these are stored in a queue and only executed when the GPU needs them to be – often several frames later than the programmer thinks. If the programmer reads data from the GPU back to the CPU then this causes the GPU to need to execute all the stored commands so that it can retrieve the required data, losing all the benefits of the parallel execution of the GPU with respect to the CPU. There is not DirectX API for calling back the CPU when a given GPU resource is available for reading, so most programmers just wait two frames and then retrieve it – it seems to work for me.






Shading Foliage

The majority of tree foliage is made of simplistic models where the leaves are simple quads of leaf textures. This enables the creation of lots of leaves with very few quads. A typical tree model is shown below, with face normals projecting from each leaf quad.


This gives a reasonable effect but when lit with a directional light, things go wrong. Firstly these models are all unculled – so we dont need to render the back-face of each of the leaf planes and save a huge bunch of vertex data because of this. Unfortunately that means each plane has only one normal, so seen from the reverse it has the same lighting reflectivity as when seen from the top.

To counter this its normal to check the face orientation in the pixel shader, and to invert the normal before calculating the light. In the pixel shader definition we can make use of the SV_ (system variables, or registers) which always exist but must be pulled into the pixel shader to be examined. You dont have to pass these SV_ values out of the vertex shader to accept them into the pixel shader.

PixelToSurface psTextureBump(psMeshIn psInput , bool isFrontFace : SV_IsFrontFace)
//Reverse the normal if its facing away from us.
if (!isFrontFace)
normal *= -1.0f; // rendering the back face, so I invert the normal

So now we have ‘accurate’ normals. But wait … theres a problem; if we use typical directional lighting code the underside of the leaves will now be dark with the upper face light. This isn’t what happens when we look at trees – they are somewhat translucent.

A second problem is that normal based lighting models are based on reflectivity – a plane facing the light is lit 100%; a plane at 45 degrees to the light reflects 50% of the light, and a plane 60 degrees from the light is lit … somewhat less. When you look at the tree image above its clear that for a light source any where near horizontal, that most of the leaf quads are at an extreme angle to the light source and are rendered very dark – in fact only the very few nearly vertical ones are rendered with realistic levels of reflectivity.

When you consider a tree its clear that the leaves, although arranged on branches, dont align with the branch in the way that is easily described in a normal. In fact a tree acts as a fractal in terms of reflection – pretty much all of the visible area of a tree has a good percentage of its leaf cover facing the viewer, no matter what the angle of the branches are.

To get a completely accurate lighting model would require the leaves to be individually rendered with normals, an impossible task.

A good approximation of tree lighting can be done by simply shading a trees pixels based on the depth from the light source – the ‘back’ of a tree is darker than the ‘front’ of a tree when viewed from the light source. To calculate this, we create a model which is centered on 0,0 and is one unit per side (i.e. in a box -1 to +1). I happen to store all my models scaled to this value anyway, so I can scale them at runtime without reference to their ‘natural’ size.

The following code snippet shows how I get the depth of a model


//Rotate vector around Y axis
float3 rotateY(float3 v, float angle)
float s = sin(angle);
float c = cos(angle);
return float3(c*v.x + s*v.z, v.y, -s*v.x + c*v.z);

// Assuming a model XZ centred around 0,0 and of a scale 1 -> -1 this returns the distance from the edge of the model when seen from the lights point of view.
// The value can be used to depth-shade a model. Returns 0 for nearest point, 1 for most distant point.
float modelDepthFromLight(float3 modelPosition)
// rotate further so that the model faces the light source. The light source is expressed in normalized vector, so
float lightAngle = atan2(-Param_DirectionalLightDirection.x, -Param_DirectionalLightDirection.z);
modelPosition = rotateY(modelPosition, lightAngle);
//float distToOrigin = (modelPosition.z + 1.0f) * 0.5f;
float distToOrigin = reverseLerp(modelPosition.z, -1.0f, 1.0f);
return 1.0f-distToOrigin;

By applying this as an input to the pixel shader I can choose to darken pixels which are further away from the light within the models bounding sphere. Although this isn’t ‘proper’ lighting, it does a good job of simulating self-shadowing, and works best on trees that are very symmetrical. If they are asymmetrical you can notice that a branch at the back of a tree is darkened but with no obvious tree cover in front of it to create that shadow.


Here are fir trees with the light source coming from top right. The trees are clearly darker on the left than the right and this can be seen nicely when viewed from behind


Pleasingly some of the edge leaves are picking up the light, but the mass of the tree is in darkness. This is a much cheaper way of similating self-shadowing and works within reason – however for very asymmetric models it does give some artefacts.

Better Models and Z-Fighting

I’ve added some better models now from TurboSquid (they do a nice range of <£5.00 medieval buildings) and added them to the landscape. Still not convincing but much better on the eye than the previous ones from the Sketchup store.


One consequence of using professionally constructed models was immediate and extensive Z-fighting. This is the well known problem that co-planar textures interfere with each other, flipping backwards and forwards between frames, because there is not enough world distance between them.

The problem is that model developers will tend to overlay textured quads on top of the basic model shape to create things like windows and doors, and roof textures. It takes less vertexes to add a simple slim box on the side of a building and paste the window texture on it, than to insert a window frame into the house mesh. Unfortunately the underlying house box model still exists beneath the window box. The diagram shows how a thin Window has been added to the house box (seen from the side)BetterModels2.jpg

This looks great in a modelling application but using a real-world rendering system there just isnt enough differentiation between the plane of the window and the plane of the house. The consequence is that the window keeps flicking between window and wall. This happens because the depth buffer, which keeps track of how far away from the camera a paricular pixel is, is stored as a 24bit number and that number represents the proportional distance between the near clipping plane and the far clipping plane that the pixel lies on. In a modelling application the near and far planes are going to encompass a very short distance; on a real-world application it could be up to 20,000 meters.

This proportional distance is stored as a logarithmic value, on the reasonable basis that the further something is from the camera, the less likely any depth differences are going to be visible to the end user. There is a huge amount of literature explaining Z-fighting on the web. The fixes for it are either;

  1. Avoiding using the hardware depth buffer entirely and write out your own calculation for depth into a texture buffer, and recycle that texture buffer back out on each render so the pixel shader can check the current pixel depth by sampling it.
  2. Do something else.

(1) is not recommended because hardware depth buffering is extremely fast and often takes place before your pixel shader is run, eliminating whole triangles where all three of their vertexes are further away than the recorded pixel depth.

So that leaves (2).

The simplest way I found to achieve a massive reduction in Z-fighting was the technique of reversing the depth buffer. There are three stages to this;

  1. When calculating the projection matrix, simply pass in the Far and Near clip values in reverse order;
  2. When clearing the depth buffer, clear it to 1,0 not 0,1
  3. When setting the depth comparison function, use GREATER_THAN not LESS_THAN

Step 1

float nearClip = ReverseDepthBuffer ? this.FarClippingDistance : this.NearClippingDistance;

float farClip = ReverseDepthBuffer ? this.NearClippingDistance : this.FarClippingDistance;

if (this.CameraType == enumCameraType.Perspective)


return Matrix.PerspectiveFovLH(







Step 2

if (reverseDepthBuffer)


this.Context.ClearDepthStencilView(dsv, SharpDX.Direct3D11.DepthStencilClearFlags.Depth, 0, 1);




this.Context.ClearDepthStencilView(dsv, SharpDX.Direct3D11.DepthStencilClearFlags.Depth, 1, 0);


I found this entirely eliminated my Z-fighting by more smoothly distributing the available depths over the field of view. This is very nicely illustrated in this NVIDIA article.


Settlement Toplogy with Grass and Trees

I’ve now combined the settlement generation with the grass generation I outlined way back. Although the buildings are still cartoonish, and the grass too uniform, the general outline is OK. A problem persists with the blending of the four different densities of grass (I use the Outerra method of doubling the grass blade width whilst halving the number of blades) – its easy to see in the YouTube video. By adjusting the terrain base texture with the same mottling interval as the grass planting distant grass sections dont need to be painted – they just (almost) seamlessly grow of the ground of the same colour as the grass blades.

One thing to work on is shadowing in grass – there are too many polygons being generated to depth test every one of them – so I think I need to add a ‘shadow pass’ to generate a shadow map of the current scene, not just a depth map. It shows where a tree’s shadow doesnt affect the underlying grass, but does affect the underlying landscape.