Grass Fin Fill Rate

The previous post illustrated a method of generating grass on the GPU using Compute Shaders. One thing that this doesn’t improve upon is the problem of Fill Rate.

The Fill Rate is the amount of times a specific pixel on screen is written to by the GPU. Its really easy to see if your application is Fill Rate restricted; just resize the viewport. If it gets more FPS then you are Fill Rate throttled.

Where the previous post talked about the use of billboards/imposters/fins to generate complex grass, it ignored the problem of the resulting fill rate impact. All those GPU generated quads had a pixel hit rate of about 10% (i.e. 90% of the texture space was alpha 0 and culled). Because clip() was being used rather than alpha blending it wasn’t as bad as it could have been, but the use of Clip() prevents early triangle rejection by the shader. Since it cannot assume, despite being in Opaque alpha mode, that any triangle is actually going to cover any other triangle it must render them all.

There isn’t really any way around this if you are using imposters; so I created a grass model patch of opaque blades and rendered that. I looked for a good low poly model but they were all generated for use in static scenes and grass is a pretty easy model to dynamically generate;

1) Generate a 2D grid of seed points. The model size should be the same as the voxel size of the orthographic planting texture discussed in the last post, such that each pixel of that (and consequent AppendBuffer point) represents a single instance of the grass patch.
2) For each seed point, generate a 0 origin blade of 7 points – two quads with a triangle on top.
3) Pick a ‘direction’ normal based a random value X,Z and translate the resulting blade points based on the square of the Y value; this means the grass bends towards the intended direction more acutely the taller it is, with the base quad being only slightly tilted toward the ‘direction’.
4) Rotate the blade points around the Y axis in a random rotation.
5) Calculate the normals of each triangle.
6) Translate to the model seed point.

Once that is created it can be set as a shader resource;

// Buffer that will be filled with the model triangle list.
struct ModelVertex
	float3 Position;
	float3 Normal;
	float2 TexCoord;
StructuredBuffer  Param_ModelBuffer : register(t15);

I create three of these models be eliminating every other seed point, to generate a less and less dense model. While doing that I double the width of the remaining grass blades.

To finish off the effect I blend the grass blade color into the ground color by lerp’ing the distance from the viewer; I do the same trick with the grass blade normal as well. This helps it blend into the flat ground texture.

You can see the effect in the YouTube video here


Unblocking the Bottleneck – Better Grass

So having tried Geometry Shader generation and simple “Big Vertex Buffer” grass generation I stumbled on the post which illustrated a pretty common (but unknown to me) method of vertex generation on the GPU that looks like it could be the solution I was after.

I searched for some time for a really clear guide to what was going on, and although the codemasters blog hinted and the basics, what I needed was to follow a step-by-step to see how it worked.

I couldn’t find one, so I thought I’d write one.


The method assumes you are generating an orthogonal (look down) image of the immediate surroundings of the eye camera in world space. I typically create two orthogonal textures per frame; one is a simple rendering of the basic landscape (without trees or other infill) and the second is a heightmap. This guide doesn’t describe creating those – it assumes you have them.

The Method

Code later; first a description of how you go about doing the work.

  1. Create a Generation Compute Shader and an Append Buffer. This will generate vertexes by examining the two orthogonal textures generated above. As it samples the two textures it will emit (append) new Vertexes for each world location where a grass blade should be, and sample the height map to work out how high it is.
  2. Create a Vertex Count Compute Shader and a Unordered Access Buffer to store the count of the vertexes generated in stage 1. This is fiddly to understand but crucial for performance
  3. Create a Vertex and Pixel Shader pair for rendering the vertexes generated in stage 1, along with a Structured Buffer to transfer the data from the Append Buffer so it can be used.

Thats it. You will of course need some grass textures (imposters/fins) to actually render, but this example will be completed when we are rendering a million quads.

The ‘trick’ to good performance with this method is that nothing must be allowed to move from the CPU to the GPU in the entire sequence of rendering. This means all of this is done on the GPU and nothing transits the memory pipeline.


You may not have come across Computer Shaders and Append Buffers before; so what are they ?

A Compute Shader is a arbitrary piece of code executed on the GPU with massive parallelization. Its important to note that the order in which your code is executed is not defined so the outputs from the Compute Shader can come out in any order, and this is important if you are using the Compute Shader to generate mesh vertexes. Normally you would output a mesh in some kind of winding order but if you are looking to generate a mesh from a Compute Shader you have to lose control over the winding order and on that basis; its not a good idea to try to generate meshes from them, so we output Particles which we later expand into meshes. For this reason sometimes the method shown here is termed Particle Shading.

We’re used to writing Vertex Shader code and Pixel Shader code both of which have pretty well-defined inputs and outputs; a VS takes a Vertex from a Vertex Buffer and outputs a struct per-vertex. The values of the struct are interpolated (using a hidden Compute Shader step) to provide a per-pixel set of values for the same struct to pass into the Pixel Shader. The PS outputs a set of scalar values, typically just a float4 set of colours (although sometimes more when performing multiple render-target operations).

The key to understanding Compute Shaders is to realise the VS and PS pair are just specialised forms of Compute Shader – they have very distinct input and output expectations which the inherit from their fixed function pipeline past; but ultimately they are just Compute Shaders with added restrictions.

A Compute Shader can take its input from any Buffer passed in as a resource, and write its output to any Buffer passed in as a resource. The VS can only read from the buffer passed in as a “VertexBuffer” and the PS can only write to the resource passed in as a “RenderTarget”; they are just very restricted forms of Compute Shader.

In this post I will be using my Compute Shader to write to an AppendBuffer and read from a Texture. An AppendBuffer is an optimised memory structure which can only be written to, so is perfect for creating new geometry.

Using Compute Shaders

Thinking back to a VS/PS pair, we can easily visualise how many times the VS is going to be called, becuase its simply the number of Indexes specified in the DrawIndexed call. The Struct passed into the VS is the n’th item in the input buffer of Structs which describe the geometry.

When we call a Compute Shader its not operating on any specific Buffer resource thats been bound to the shader step – it could operate one, many or none of the Buffers that it can access. The Compute Shader is simply told to execute ‘n’ times by the programmer using two controlling parameters.

In HLSL the Compute Shader must have the following prefix;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)


The actual numbers here get presented to the Compute Shader code via its input struct threadID. This is a 3 dimensional uint which tells the programmer which thread the Compute Shader is currently running. In the above example the value of threadID will be threadId.x 1->32, threadId.y 1->32, threadId.z = 1. This means the Compute Shader will be executed 32x32x1 times.

If we are intending to sample a Texture2D resouce passed into the shader, we can use the threadId as a parameter to the Load function to load and examine a specific pixel.

Texture2D inputTexture;
uint2 textureSize;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
     float4 pixelValue = inputTexture.Load(threadID.x,threadID.y);

Note the need for the guard code; having told the shader to execute 32×32 times we need to make sure we only sample the texture within the textures boundaries (which we explicitly pass in via the textureSize parameter).

What should we do with the sampled pixel ?

Texture2D inputTexture;
uint2 textureSize;
AppendBuffer foliageLocation;
float voxelSize;
float2 worldOffset;

void MyComputeShader(uint3 threadID : SV_DispatchThreadID)
  if(threadID.x < textureSize.x && threadID.y < 0.5f)
         worldOffset.x + (threadID.x * voxelSize),
         worldOffset.y + (threadID.y * voxelSize)

In the example of a Compute Shader sampling a texture it makes sense to define a sensible set of numthreads() in the shader such as 32,32,1 and then calculate the best number to use within the program – its there where we can tune the number of threads to the actual size of the texture being passed in;

  device.ImmediateContext.Dispatch(textureSize.x / 32,textureSize.y / 32,1);

In the above case my shader, for a texture size of 100×100 pixels, will be called 16 times (4×4) and the ThreadID.x and ThreadID.y values will go from 0 to 128. We must always bear in mind that shader may be called with parameters outside the bounds of the Texture we are passing in and this is even more important if the shader is accessing a data structure that is not boundary-checked; we could easily get errant results by sampling beyond an array boundary.

The Code

The Generation Step

There will be lot of code examples, and they use my own SharpDX library to prevent code bloat. The referenced FoliageEffect class is simply a collection of the resources needed to complete the render, and it just exposes out SharpDX resources to the render loop. Hopefully its clear as to how it works – I’m happy to add more code examples if anything is not clear.

Create a constant buffer (DrapeConstants) which describes the orthogonal “drape” texture that we will use to drive the generation;

// Cull the elements generated to those we can actually see
BoundingBox bb = BoundingBox.FromPoints(eyeCamera.ViewFrustum.GetCorners());
// Create a struct to pass to the shader 
  drapeConstants = new RenderedComponents.FoliageEffect.DrapeConstants();

// give a world bottom left offset
drapeConstants.DrapeSmallestXZCorner = new Vector2(

// tell the shader the texture pixel size and world size
drapeConstants.DrapeTextureSize = drapeTexture.Texture.Description.Width;
drapeConstants.DrapeWorldSize = drapeTextureWorldRectangle.Width;

// tell the shader how tall we will want the generated grass
drapeConstants.BladeMaxHeight = 0.35f;

// tell the shader how far from the camera position that teh grass will 
// be visible from
drapeConstants.FoliageRadiusFromCentre = 20.0f;

// tell the shader how much of the world is currently visible. We
// will use this to restrict the zone of the texture that we need to sample
drapeConstants.VisibleBox = new Vector4(
  bb.Minimum.X, bb.Minimum.Z, bb.Maximum.X, bb.Maximum.Z);

// We will later pass in a texture array of foliage vertexes; we pass
// in here how big that array will be.
drapeConstants.TextureAtlasTileCount = 7;

// Updae the buffer and pass it into the Compute Shader stage.
  ref drapeConstants,

Next bind the ‘drape’ texture and heightmap to the shader ready for sampling

// Bind a texture buffer containing the orthogonal 'drape' picture of our 
// immediate surroundings, and resource view to the compute shader stage
foliageEffect.DrapeTexture.Texture = drapeTexture.Texture;
foliageEffect.DrapeTexture.ShaderResourceView =

// Bind the height map and its SRV to the compute shader stage
foliageEffect.HeightMap.Texture = heightTexture.Texture;
foliageEffect.HeightMap.ShaderResourceView = heightTexture.ShaderResourceView;

Create a Buffer which will be used for the generated vertexes to be appended to. In DirectX 11 it has the concept of an append-only buffer that a compute shader can write to. Declare it like this (the value InitialData is a single dimension array with enough space to fit the maximum number of grass vertexes in, in theory one element per pixel – its only used in initialization and can be discarded afterwards).

// The Vertex append buffer is a Unordered Access Buffer
SharpDX.Direct3D11.Buffer vertexAppendBuffer =
    SharpDX.Direct3D11.BindFlags.UnorderedAccess |
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    // Will not be read back to the CPU
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    optionFlags: SharpDX.Direct3D11.ResourceOptionFlags.BufferStructured,
    structureByteStride: FoliageFinVertex.GetSize());

// Create a view on this texture
SharpDX.Direct3D11.UnorderedAccessView vertexAppendBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
      Format = SharpDX.DXGI.Format.Unknown,
      Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
      Buffer = new
        FirstElement = 0,
        // This represents the maximum number of elements that can be in
        // the buffer, in theory one value per pixel of the drape texture
        ElementCount = initialData.Length,
        Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.Append

The struct FoliageFinVertex is the data that will be generated per blade of grass by the Geometry Shader. My simple set of data looks like this;

Create a Buffer which will store the number of vertexes generated.

[StructLayout(LayoutKind.Explicit, Size = 16)
public struct VertexCountConstants
   public uint Param_VertexCountPerInstance;
   public uint Param_InstanceCount;
   public uint Param_StartVertex;
   public uint Param_StartInstance;
   /// Standard binding slot of B12
   public static int BindingSlot { get { return 12; } 

Bind both the Append Buffer and the ‘count’ constant buffer to the shader

   // The uavInitialCount sets the starting point for any Append operations.
   uavInitialCount: 0);


Bind all the shader stages and execute the Compute Shader

this.MonitoredGraphicsDevice.BindVertexShader(null, null,


// Its defined as being a 32x32 thread shader, so we need to execute that
// multiple times. Since we want to span our entire texture,
// we need to run it Width/32 times - so for a texture of width 2048 we want
// to run the shader 64 times.

   drapeTexture.Texture.Description.Width / 32,
   drapeTexture.Texture.Description.Width / 32, 1);

The Geometry Shader looks like this;

// The buffer containing information about the drape texture and other controlling variables
cbuffer PerDrapeBuffer : register(b11)
  float2 Param_DrapeSmallestXZCorner;
  float Param_DrapeWorldSize;
  float Param_DrapeTextureSize;

  float Param_BladeMaxHeight;
  float Param_TextureAtlasTileCount;
  float Param_FoliageRadiusFromCentre;
  float _filler4;

  // SmallestXZ, LargestXZ of the visible box.
  float4 Param_VisibleBox;

// The data we will create for each particle that we will elaborate into a billboard/imposter/fin later
struct FoliageFinVertex
  float3 Position;
  float BladeHeight;
  int BladeType;
  float Rotation;

// Scatter of foliage around a fin centre. 
static const float2 scatterKernel[8] =
  float2(0 , 0),
  float2(0.8418381f , -0.8170416f),
  float2(-0.9523101f , 0.5290064f),
  float2(-0.1188585f , -0.1276977f),
  float2(-0.207716f, 0.09361804f),
  float2(0.1588526f , 0.440437f),
  float2(-0.6105742f , 0.07276237f),
  float2(-0.09883061f , 0.4942337f)

// This is the top-down orthogonal view of the immediate surroundings
Texture2D Param_DrapeTexture : register(t10);
// This is the orthogonal height map - same world size and origin as the Drape texture
Texture2D Param_HeightTexture: register(t11);
// This is the Buffer we will add Fin Vertexes (particles) into.
AppendStructuredBuffer Param_AppendBuffer: register(u0);

// This gives me 32x32 threads.
[numthreads(32, 32, 1)]
void GenerateFoliage(uint3 threadID : SV_DispatchThreadID)
  // For example if Dispatch(2, 2, 2) is called on a compute shader with numthreads(3, 3, 3) SV_DispatchThreadID will have a range of 
  // 0..5 for each dimension.

  // theradId.xy is the first two dimensions - 
  float4 drapePixel = Param_DrapeTexture.Load(int3(threadID.xy,0));

  // Typically voxel size is 0.25m. To get good coverage we generate a scatter per pixel.
  float voxelSize = Param_DrapeWorldSize / Param_DrapeTextureSize;
  // Need to invert y coord when sampling from the height map
  float height = Param_HeightTexture.Load(int3(threadID.x, Param_DrapeTextureSize - threadID.y, 0));
  // Generate a reference world position
  float3 worldPosition = float3(
    Param_DrapeSmallestXZCorner.x + (threadID.x * voxelSize), 
    Param_DrapeSmallestXZCorner.y + (threadID.y * voxelSize));

  // Work out where the camera must have been to generate the drape texture we see
  float2 cameraCentre = float2(
     Param_DrapeSmallestXZCorner.x + (Param_DrapeWorldSize / 2), 
     Param_DrapeSmallestXZCorner.y + (Param_DrapeWorldSize / 2));
  // Guard code to make sure we are sampling within the scope of the texture. 
  if (threadID.x < (uint)Param_DrapeTextureSize && threadID.y < (uint)Param_DrapeTextureSize)
    // Make sure we are within our clipping radius
    if (length(worldPosition.xz - cameraCentre) = Param_VisibleBox.x && worldPosition.x = Param_VisibleBox.y && worldPosition.z <= Param_VisibleBox.w)
        // In a real example we wouldnt just create a Fin for every pixel - we'd sample the colour of the 
        // drape and other aspects which would control the planting decision. Here were just creating a 
        // grass patch for every pixel

        // Create a scatter around the single pixel sampled from the drape.
        for (int scatterKernelIndex = 0; scatterKernelIndex < 8; scatterKernelIndex++)

          float3 finPosition = float3(worldPosition.x + (scatterKernel[scatterKernelIndex].x * voxelSize), 
                                    worldPosition.z + (scatterKernel[scatterKernelIndex].y * voxelSize));

          FoliageFinVertex fin = (FoliageFinVertex)0;
          fin.Position = finPosition;
          // We should vary the grass patch height randomly or via Simplex Noise - but for the example we'll leave them constant
          fin.BladeHeight = Param_BladeMaxHeight ;
          fin.BladeType = Param_TextureAtlasTileCount - 1;
          // Again; we've rotate based on simplex, but for the sake of example we'll leave it at 0.
          fin.Rotation = 0;
          // Add the fin location to the append buffer

The Counting Step

Great – we now have a memory buffer on the GPU filled with thousands of FoliageFinVertex structures – so now we want to render them via a VS/PS pair. But hold on a second – we don’t have a Vertex Buffer – how do we call Draw ? It takes a parameter which is the number of vertexes; how do we know how many we’ve created ?

We could drag the entire AppendBuffer back to the CPU via a Map command and count it, but this would cause a massive pipeline stall.

Luckily there is a pipeline command which allows us to copy the number of outputs generated to a much smaller GPU memory buffer. There is a fixed format for this small memory buffer – it must be uint4. We can make it less abstract by listing each field seperately showing what each uint means by creating a Constant Buffer as follows.

// Constant Buffer into which the append buffer data has been copied
[StructLayout(LayoutKind.Explicit, Size = 16)]
public struct VertexCountConstants 
  public uint Param_VertexCountPerInstance;
  public uint Param_InstanceCount;
  public uint Param_StartVertex;
  public uint Param_StartInstance;

  /// <summary>
  /// Standard binding slot of B12
  /// </summary>
  public static int BindingSlot { get { return 12; } }


We create a Constant Buffer to hold this counting data and call it CountConstantBuffer

Immediately after our Dispatch call to fill the AppendBuffer we call the CopyStructureCount method which interrogates the buffers size and writes the result to the CountConstantBuffer.

   dstBufferRef: foliageEffect.CountConstantBuffer.Buffer, 
   dstAlignedByteOffset: 0, 
   srcViewRef: foliageEffect.VertexAppendBuffer.UnorderedAccessView);

Now we have a really small buffer on the GPU holding the data about how much was created by our AppendBuffer call. This is useful, but getting it back to the CPU would again stall the pipeline, and it still wouldn’t give us the number of vertexes we want to create – we want to create 6 Vertexes per particle to form our Quad billboard.

The key to using this data is to use the DrawInstancedIndirect pipeline method. This method is the same as “DrawInstanced” and takes the same parameters but instead of the parameters being passed by the Program to the shader, the shader is instructed to get the parameters from an existing GPU memory buffer.

The format of the buffer for passing these parameters is particular – it must be a uint4 ; heres my one;

uint[] bufferData = new uint[4];
SharpDX.Direct3D11.Buffer drawIndirectArgumentsBuffer =
    bindFlags: SharpDX.Direct3D11.BindFlags.UnorderedAccess |
    data: bufferData,
    usage: SharpDX.Direct3D11.ResourceUsage.Default,
    accessFlags: SharpDX.Direct3D11.CpuAccessFlags.None,
    structureByteStride: sizeof(uint));

SharpDX.Direct3D11.UnorderedAccessView drawIndirectArgumentsBufferView = 
  new SharpDX.Direct3D11.UnorderedAccessView(
    new SharpDX.Direct3D11.UnorderedAccessViewDescription()
       Format = SharpDX.DXGI.Format.R32_UInt,
       Dimension = SharpDX.Direct3D11.UnorderedAccessViewDimension.Buffer,
       Buffer = new
         FirstElement = 0,
         ElementCount = bufferData.Length,
         Flags = SharpDX.Direct3D11.UnorderedAccessViewBufferFlags.None

So how do we get our existing VertexCountConstants constant buffer into the DrawIndirectArgumentsBuffer, especially when we need to multiply it up by 6 ? Back to Compute Shaders again. Just pass both buffers to a compute shader and let it fill the DrawIndirectArgumentsBuffer from the data in the VertexCountConstants;

// Release the heigth and drape textures so they can be written to again.


// This causes the counter from the vertex append buffer to be written to 
// the appendCountConstantBuffer
  uavInitialCount: 0);

// Now call the CS again to write the constant buffer variables into 
// the parameter buffer. We want to multiply up the number of 
// vertexes generated, so we need to render this data again.
   1, 1, 1);

Heres the corresponding compute shader;

numthreads(1, 1, 1)
void CountVertexes(uint3 id : SV_DispatchThreadID)
   if (id.x == 0 && id.y == 0 && id.z == 0)
     // We multiply by 6 because we want two triangles rendered for 
     // this position.
     Param_DispatchIndirectArguments[0] = Param_VertexCountPerInstance * 6;
     // InstanceCount
     Param_DispatchIndirectArguments[1] = 1;
     // StartVertex
     Param_DispatchIndirectArguments[2] = 0;
     // StartInstance
     Param_DispatchIndirectArguments[3] = 0;

Eventually, we will come onto actually rendering the billboards. I wont post my billboard VS/PS pair here as they are really standard code. The key is how this VS/PS pair is now called;

// Pass vertex buffers - none in this case.
    new SharpDX.Direct3D11.VertexBufferBinding[] { }
// No indexes either
this.MonitoredGraphicsDevice.Indices = null;
// Need to use DrawIndirect here.
  (foliageEffect.DrawIndirectArgumentsBuffer.Buffer, 0);

Because you are not passing any VertexBuffer data into the VS it looks a little different to normal;

// The vertex shader code. Simple VSPS pair
psIn_FoliagePatch vsFoliagePatch_FromAppendBuffer(uint vertexID:  SV_VertexID)

  // Get the correct vertex definition. Since we are creating 6 vertexes 
  // for each item in the vertrexes StructuredBuffer we should divide 
  // the vertexID by 6.
  uint foliageFinID = floor(vertexID / 6); 
  // finVertexID is the n'th vertex for the specific fin, from 0->5
  // later code in the VS can create an offset from the particle location
  // based on the foliageFinID for the top-left, bottom-left etc vertex
  // relative positions.
  uint finVertexID = vertexID - (foliageFinID * 6);
  FoliageFinVertex vsInput = Param_VertexReadBuffer[foliageFinID];


This whole method can be used to replace an existing CPU based method for passing in VertexBuffers into an existing Imposter/fin renderer. Its advantage over CPU variants are;

  1. Making use of an existing orthogonal heightmap and drape
  2. Uses no CPU resources at all, and no data transfer back from the GPU
  3. Can be tuned really easily in various places

Geometry Shader Woes

So, generating lots of billboards and grass; Geometry Shader right ?

Depends on the GPU. I have a nice implementation of GS generated grass and billboards based on instanced GS from 2D points passed in a VB. Expansion of the points is done in two stages; by an instance buffer on the CPU giving the world location of a mesh of points, and then by tbe GPU GS instancing.

The results on a powerful GPU are good; less transfer of data to the GPU and substantially smaller models. Also the elimination of uneeded grass vertexes in the GS (depending on the result of a fractal noise calc) meant no need to pass in and eliminate degenerate triangles.

On a slightly lower spec card it tanked. Performance was aweful. Changing the code to a (much) larger model grass patch and passing it through a simple VS/PS shader was much faster by a factor of 4. There was a large number of degenerate triangles discarded between the VS and PS, which in the GS would never even have been generated, but still “the stopwatch never lies”.

Vertex Buffer Normals or HeightMap Generated ?

Simple answer; Vertex Buffer is faster.

Calculating normals in the pixel or vertex shader requires a minimum 3-tap linear interpolated sample for each normal. Plus the cost of keeping and passing in a height map texture. Sampling a texture is one of the more expensive calculations. For the cost of 12 bytes per vertex, I found that passing in a precalculated normal was faster, especially on less powerful cards.

Since it was neccesssary to carry out a state change for a landscape tile to pass in the tile’s VB it seeemed a good bet that passing in a new Heightmap texture wouldn’t hurt performance that much, but it did.

The CPU->GPU bandwidth required for the heightmap is 4 bytes per vertex, and with the Vertex Buffer its 12 bytes, but in the latter the interpolation is done on the VS->PS boundary by the automatic interpolation whereas with the Heightmap method you have to do the calcs yourself, with the additional cost of the texture sampling.



Global World

Long time with no posts …

I wondered how big I could make my world. Could I make it global ? What about the world generation time, how long would that take ?

What if it was nil, and the world was entirely procedurally generated ?

The components are already there;

  • Diamond square height generation
  • Trees, grass, rivers and erosion
  • DirectX11

There were two key things to overcome

  1. The system, once running, must have a baseline memory allocation that does not grow – consequently all items must be able to be regenerated on demand.
  2. The performance of key items like the diamond-square fractal must be really fast.

Regnerating Landscape

The first problem is solved by having a clear dependency concept behind any renderable object; small tiles need bigger tiles so they can get generated from the parent heights; trees need a heightmap so they can be located in the world. Rivers need heightmaps (and need to be able to deform the heightmaps). Linking all this up with the scene composition engine which had previously been able to assume all dependencies were available in the pre-generated landscape store was a big engineering challenge. The important structural changes were;

  • No component can demand a world resource, they can only request a world resource
  • Code must be resiliant to a resource not being available
  • Resource requests may cascade a number of further resource requests which may be carried out over multiple frames

Heightmap Generation Performance

I need the heightmap data on the CPU so I can query the meshes for runtime generation of trees, and pretty much anything else that needs to be height dependent, including the generation of a tiles vertex buffer. The CPU performance of the fractal based diamond square algorithm was just about OK, but the real issue came when trying to manipulate the resultant heightfield to overlay deformations (rivers , roads, building area platforms etc). The time required to query every height map point against a large set of deformation meshes was not acceptible.

The answer, like all things DirectX, was to use the shader to implement my diamond square fractal height generation. The steps to being able to implement this were;

  1. Read the undeformed parent height field in CPU.
  2. Prepare a texture for the child heightfield from one of the quadrants of the parent undeformed heightfield with every other pixel left empty, to be filled in the shader.
  3. Call the height generation shader passing the child height texture, and execute the diamond square code that fills in the missing pixels by reference to adjacent pixels.
  4. Record the output texture and read the data into the child tile class as the undeformed height map
  5. From another data structure describing landscape features like roads and rivers, obtain a vertex buffer which contains the deformations that the feature requires in terms of heightmap offsets
  6. Render the deformation vertex buffers over the top of the child heightmap
  7. Read back the newly deformed heightmap to the child tile CPU, to be used as the ‘true’ heightmap for all subsequent height queries and mesh generation.

All tiles have both a deformed and an undeformed heightmap data array stored. It took a long while to get to this solution, ultimately the problem was that the diamond square algorithm can only produce a new value with reference to the existing parent values – so it generates a very pleasant ‘random’ landscape, but it doesn’t allow for erosion, rivers, linear features, or any other ability to create absolute changes in height.

By storing the raw output of the diamond square algorithm, any deformations I need can be applied over the top of the raw heightfield and get the same perceived results at any resolution. Since my tile heightfields are only 129×129 pixels its not a lot of memory.

I immediately hit the problem of pipeline stalling when reading back the rendered heightfield data to the CPU, but a 2 frame delay injected between rendering the heightfield and reading it back was sufficient to remove the stutter. This problem is well documented and relates the underlying architecture and programming models of GPUs – although the programmer issues commands to the GPU these are stored in a queue and only executed when the GPU needs them to be – often several frames later than the programmer thinks. If the programmer reads data from the GPU back to the CPU then this causes the GPU to need to execute all the stored commands so that it can retrieve the required data, losing all the benefits of the parallel execution of the GPU with respect to the CPU. There is not DirectX API for calling back the CPU when a given GPU resource is available for reading, so most programmers just wait two frames and then retrieve it – it seems to work for me.






Shading Foliage

The majority of tree foliage is made of simplistic models where the leaves are simple quads of leaf textures. This enables the creation of lots of leaves with very few quads. A typical tree model is shown below, with face normals projecting from each leaf quad.


This gives a reasonable effect but when lit with a directional light, things go wrong. Firstly these models are all unculled – so we dont need to render the back-face of each of the leaf planes and save a huge bunch of vertex data because of this. Unfortunately that means each plane has only one normal, so seen from the reverse it has the same lighting reflectivity as when seen from the top.

To counter this its normal to check the face orientation in the pixel shader, and to invert the normal before calculating the light. In the pixel shader definition we can make use of the SV_ (system variables, or registers) which always exist but must be pulled into the pixel shader to be examined. You dont have to pass these SV_ values out of the vertex shader to accept them into the pixel shader.

PixelToSurface psTextureBump(psMeshIn psInput , bool isFrontFace : SV_IsFrontFace)
//Reverse the normal if its facing away from us.
if (!isFrontFace)
normal *= -1.0f; // rendering the back face, so I invert the normal

So now we have ‘accurate’ normals. But wait … theres a problem; if we use typical directional lighting code the underside of the leaves will now be dark with the upper face light. This isn’t what happens when we look at trees – they are somewhat translucent.

A second problem is that normal based lighting models are based on reflectivity – a plane facing the light is lit 100%; a plane at 45 degrees to the light reflects 50% of the light, and a plane 60 degrees from the light is lit … somewhat less. When you look at the tree image above its clear that for a light source any where near horizontal, that most of the leaf quads are at an extreme angle to the light source and are rendered very dark – in fact only the very few nearly vertical ones are rendered with realistic levels of reflectivity.

When you consider a tree its clear that the leaves, although arranged on branches, dont align with the branch in the way that is easily described in a normal. In fact a tree acts as a fractal in terms of reflection – pretty much all of the visible area of a tree has a good percentage of its leaf cover facing the viewer, no matter what the angle of the branches are.

To get a completely accurate lighting model would require the leaves to be individually rendered with normals, an impossible task.

A good approximation of tree lighting can be done by simply shading a trees pixels based on the depth from the light source – the ‘back’ of a tree is darker than the ‘front’ of a tree when viewed from the light source. To calculate this, we create a model which is centered on 0,0 and is one unit per side (i.e. in a box -1 to +1). I happen to store all my models scaled to this value anyway, so I can scale them at runtime without reference to their ‘natural’ size.

The following code snippet shows how I get the depth of a model


//Rotate vector around Y axis
float3 rotateY(float3 v, float angle)
float s = sin(angle);
float c = cos(angle);
return float3(c*v.x + s*v.z, v.y, -s*v.x + c*v.z);

// Assuming a model XZ centred around 0,0 and of a scale 1 -> -1 this returns the distance from the edge of the model when seen from the lights point of view.
// The value can be used to depth-shade a model. Returns 0 for nearest point, 1 for most distant point.
float modelDepthFromLight(float3 modelPosition)
// rotate further so that the model faces the light source. The light source is expressed in normalized vector, so
float lightAngle = atan2(-Param_DirectionalLightDirection.x, -Param_DirectionalLightDirection.z);
modelPosition = rotateY(modelPosition, lightAngle);
//float distToOrigin = (modelPosition.z + 1.0f) * 0.5f;
float distToOrigin = reverseLerp(modelPosition.z, -1.0f, 1.0f);
return 1.0f-distToOrigin;

By applying this as an input to the pixel shader I can choose to darken pixels which are further away from the light within the models bounding sphere. Although this isn’t ‘proper’ lighting, it does a good job of simulating self-shadowing, and works best on trees that are very symmetrical. If they are asymmetrical you can notice that a branch at the back of a tree is darkened but with no obvious tree cover in front of it to create that shadow.


Here are fir trees with the light source coming from top right. The trees are clearly darker on the left than the right and this can be seen nicely when viewed from behind


Pleasingly some of the edge leaves are picking up the light, but the mass of the tree is in darkness. This is a much cheaper way of similating self-shadowing and works within reason – however for very asymmetric models it does give some artefacts.

Better Models and Z-Fighting

I’ve added some better models now from TurboSquid (they do a nice range of <£5.00 medieval buildings) and added them to the landscape. Still not convincing but much better on the eye than the previous ones from the Sketchup store.


One consequence of using professionally constructed models was immediate and extensive Z-fighting. This is the well known problem that co-planar textures interfere with each other, flipping backwards and forwards between frames, because there is not enough world distance between them.

The problem is that model developers will tend to overlay textured quads on top of the basic model shape to create things like windows and doors, and roof textures. It takes less vertexes to add a simple slim box on the side of a building and paste the window texture on it, than to insert a window frame into the house mesh. Unfortunately the underlying house box model still exists beneath the window box. The diagram shows how a thin Window has been added to the house box (seen from the side)BetterModels2.jpg

This looks great in a modelling application but using a real-world rendering system there just isnt enough differentiation between the plane of the window and the plane of the house. The consequence is that the window keeps flicking between window and wall. This happens because the depth buffer, which keeps track of how far away from the camera a paricular pixel is, is stored as a 24bit number and that number represents the proportional distance between the near clipping plane and the far clipping plane that the pixel lies on. In a modelling application the near and far planes are going to encompass a very short distance; on a real-world application it could be up to 20,000 meters.

This proportional distance is stored as a logarithmic value, on the reasonable basis that the further something is from the camera, the less likely any depth differences are going to be visible to the end user. There is a huge amount of literature explaining Z-fighting on the web. The fixes for it are either;

  1. Avoiding using the hardware depth buffer entirely and write out your own calculation for depth into a texture buffer, and recycle that texture buffer back out on each render so the pixel shader can check the current pixel depth by sampling it.
  2. Do something else.

(1) is not recommended because hardware depth buffering is extremely fast and often takes place before your pixel shader is run, eliminating whole triangles where all three of their vertexes are further away than the recorded pixel depth.

So that leaves (2).

The simplest way I found to achieve a massive reduction in Z-fighting was the technique of reversing the depth buffer. There are three stages to this;

  1. When calculating the projection matrix, simply pass in the Far and Near clip values in reverse order;
  2. When clearing the depth buffer, clear it to 1,0 not 0,1
  3. When setting the depth comparison function, use GREATER_THAN not LESS_THAN

Step 1

float nearClip = ReverseDepthBuffer ? this.FarClippingDistance : this.NearClippingDistance;

float farClip = ReverseDepthBuffer ? this.NearClippingDistance : this.FarClippingDistance;

if (this.CameraType == enumCameraType.Perspective)


return Matrix.PerspectiveFovLH(







Step 2

if (reverseDepthBuffer)


this.Context.ClearDepthStencilView(dsv, SharpDX.Direct3D11.DepthStencilClearFlags.Depth, 0, 1);




this.Context.ClearDepthStencilView(dsv, SharpDX.Direct3D11.DepthStencilClearFlags.Depth, 1, 0);


I found this entirely eliminated my Z-fighting by more smoothly distributing the available depths over the field of view. This is very nicely illustrated in this NVIDIA article.