Geometry Shader Woes

So, generating lots of billboards and grass; Geometry Shader right ?

Depends on the GPU. I have a nice implementation of GS generated grass and billboards based on instanced GS from 2D points passed in a VB. Expansion of the points is done in two stages; by an instance buffer on the CPU giving the world location of a mesh of points, and then by tbe GPU GS instancing.

The results on a powerful GPU are good; less transfer of data to the GPU and substantially smaller models. Also the elimination of uneeded grass vertexes in the GS (depending on the result of a fractal noise calc) meant no need to pass in and eliminate degenerate triangles.

On a slightly lower spec card it tanked. Performance was aweful. Changing the code to a (much) larger model grass patch and passing it through a simple VS/PS shader was much faster by a factor of 4. There was a large number of degenerate triangles discarded between the VS and PS, which in the GS would never even have been generated, but still “the stopwatch never lies”.

Advertisements

Vertex Buffer Normals or HeightMap Generated ?

Simple answer; Vertex Buffer is faster.

Calculating normals in the pixel or vertex shader requires a minimum 3-tap linear interpolated sample for each normal. Plus the cost of keeping and passing in a height map texture. Sampling a texture is one of the more expensive calculations. For the cost of 12 bytes per vertex, I found that passing in a precalculated normal was faster, especially on less powerful cards.

Since it was neccesssary to carry out a state change for a landscape tile to pass in the tile’s VB it seeemed a good bet that passing in a new Heightmap texture wouldn’t hurt performance that much, but it did.

The CPU->GPU bandwidth required for the heightmap is 4 bytes per vertex, and with the Vertex Buffer its 12 bytes, but in the latter the interpolation is done on the VS->PS boundary by the automatic interpolation whereas with the Heightmap method you have to do the calcs yourself, with the additional cost of the texture sampling.

 

 

Global World

Long time with no posts …

I wondered how big I could make my world. Could I make it global ? What about the world generation time, how long would that take ?

What if it was nil, and the world was entirely procedurally generated ?

The components are already there;

  • Diamond square height generation
  • Trees, grass, rivers and erosion
  • DirectX11

There were two key things to overcome

  1. The system, once running, must have a baseline memory allocation that does not grow – consequently all items must be able to be regenerated on demand.
  2. The performance of key items like the diamond-square fractal must be really fast.

Regnerating Landscape

The first problem is solved by having a clear dependency concept behind any renderable object; small tiles need bigger tiles so they can get generated from the parent heights; trees need a heightmap so they can be located in the world. Rivers need heightmaps (and need to be able to deform the heightmaps). Linking all this up with the scene composition engine which had previously been able to assume all dependencies were available in the pre-generated landscape store was a big engineering challenge. The important structural changes were;

  • No component can demand a world resource, they can only request a world resource
  • Code must be resiliant to a resource not being available
  • Resource requests may cascade a number of further resource requests which may be carried out over multiple frames

Heightmap Generation Performance

I need the heightmap data on the CPU so I can query the meshes for runtime generation of trees, and pretty much anything else that needs to be height dependent, including the generation of a tiles vertex buffer. The CPU performance of the fractal based diamond square algorithm was just about OK, but the real issue came when trying to manipulate the resultant heightfield to overlay deformations (rivers , roads, building area platforms etc). The time required to query every height map point against a large set of deformation meshes was not acceptible.

The answer, like all things DirectX, was to use the shader to implement my diamond square fractal height generation. The steps to being able to implement this were;

  1. Read the undeformed parent height field in CPU.
  2. Prepare a texture for the child heightfield from one of the quadrants of the parent undeformed heightfield with every other pixel left empty, to be filled in the shader.
  3. Call the height generation shader passing the child height texture, and execute the diamond square code that fills in the missing pixels by reference to adjacent pixels.
  4. Record the output texture and read the data into the child tile class as the undeformed height map
  5. From another data structure describing landscape features like roads and rivers, obtain a vertex buffer which contains the deformations that the feature requires in terms of heightmap offsets
  6. Render the deformation vertex buffers over the top of the child heightmap
  7. Read back the newly deformed heightmap to the child tile CPU, to be used as the ‘true’ heightmap for all subsequent height queries and mesh generation.

All tiles have both a deformed and an undeformed heightmap data array stored. It took a long while to get to this solution, ultimately the problem was that the diamond square algorithm can only produce a new value with reference to the existing parent values – so it generates a very pleasant ‘random’ landscape, but it doesn’t allow for erosion, rivers, linear features, or any other ability to create absolute changes in height.

By storing the raw output of the diamond square algorithm, any deformations I need can be applied over the top of the raw heightfield and get the same perceived results at any resolution. Since my tile heightfields are only 129×129 pixels its not a lot of memory.

I immediately hit the problem of pipeline stalling when reading back the rendered heightfield data to the CPU, but a 2 frame delay injected between rendering the heightfield and reading it back was sufficient to remove the stutter. This problem is well documented and relates the underlying architecture and programming models of GPUs – although the programmer issues commands to the GPU these are stored in a queue and only executed when the GPU needs them to be – often several frames later than the programmer thinks. If the programmer reads data from the GPU back to the CPU then this causes the GPU to need to execute all the stored commands so that it can retrieve the required data, losing all the benefits of the parallel execution of the GPU with respect to the CPU. There is not DirectX API for calling back the CPU when a given GPU resource is available for reading, so most programmers just wait two frames and then retrieve it – it seems to work for me.