/ WebGL

Drawing points properly in WebGL

The one true way to draw sprites fast. Ever wanted WebGL / OpenGL to draw a quad given only a single point and size? Set color, rotation and texture per sprite, not per vertex?

Like this (click for fireworks):


We should burninate gl.POINTS sprites. They have various problems compared to quads:

  • Limited size (depending on hardware as low as 63 pixels per side).
  • No rotation, fixed aspect ratio.
  • Cannot be drawn partially offscreen (on some hardware) unless the center is visible.

The main benefit of points is only transferring their coordinates once, because sending data from CPU to GPU is expensive. Meanwhile rectangles have four corners with different positions.

But can we generate rectangle corners from one point per sprite, on the GPU? For multiple independent sprites per draw call? Yes we can.

The buffer swizzling trick 🎉

It works even on WebGL 1.0, no extensions needed.

Let's store the coordinates of 3 sprites in a WebGL buffer of 32-bit floats arranged like this:

.. .. .. .. x1 y1 z1 w1 x2 y2 z2 w2 x3 y3 z3 w3 .. .. .. ..

Per sprite we have 4 coordinate components and need 4 corners. If forced to use the buffer above, how to do it? Observe the following:

  • The vertex shader can have at least 8 different vec4 attributes.
  • For every vertex, each vec4 attribute will contain 4 consecutive numbers from the buffer.
  • For every attribute of the first vertex, we can freely set the location in the buffer it comes from.
  • For subsequent vertices, the location for each attribute changes. We can choose by how much ("stride"), separately for each attribute.

Clearly the first sprite needs x1 ... w1 and the second sprite needs x2 ... w2 so between the 4 vertices, we need to advance by 4 locations in the buffer or 1 location per vertex.

So the 4 vertices of the first sprite see:

1st x1 y1 z1 w1
2nd y1 z1 w1 x2
3rd z1 w1 x2 y2
4th w1 x2 y2 z2

But the 4th vertex needs also x1 ... z1 and combined with the w1. What now?

  • Read the previous 4 numbers into another attribute!
  • Leave the first 4 locations of the buffer unused to avoid reading out of bounds.
  • Shuffle (swizzle) the required components into a local vec4 variable. This is especially easy on GPUs.

But shaders execute in parallel without additional input or communication between vertices. If a shader receives w1 in the x component of a vec4 attribute, how does it know where it belongs? It needs to distiguish between the 1st or 4th vertex of a sprite. Another input, another buffer! Bytes with numbers modulo 4 is enough:

0 1 2 3 0 1 2 3 0 1 2 3 .. .. .. ..

This buffer is accessed more conventionally: first byte for first vertex, second byte for second vertex and so on. Note that it never needs updating once initialized.

The remaining 6 bits of each byte can also be used for whatever per-sprite integer data there may be, but should probably be constant for all vertices of the same sprite.

We can also store colors or any other necessary data in groups of 4 numbers per sprite, and handle them just like the coordinates:

.. .. .. .. r1 g1 b1 a1 r2 g2 b2 a2 .. .. .. ..

For every input vector we write 4 components on the CPU, read 8 components into two vec4 attributes on the GPU, swizzle to discard 4 junk components and extract the correct input.

Buffer swizzling code

Branchless vertex shader:

attribute float aFlags;

attribute vec4 aPos1;
attribute vec4 aPos2;

// Swizzle components of vector pair a, b rightwards and return b.
// Mask components are 0 or 1 to set the "shift amount":
// 1000 => 0, 0100 => 1, 0010 => 2, 0001 => 3.

vec4 swizzleRight4(vec4 a, vec4 b, vec4 mask) {
	vec4 result = b * mask.x;

	b.w = a.w; result += b.wxyz * mask.y;
	b.z = a.z; result += b.zwxy * mask.z;
	b.y = a.y; result += b.yzwx * mask.w;

	return(result);
}

void main() {
	float mod4 = mod(aFlags, 4.0);

	vec4 mask = vec4(
		step(mod4, 0.0),
		step(mod4, 1.0) - step(mod4, 0.0),
		step(2.0, mod4) - step(3.0, mod4),
		step(3.0, mod4)
	);

	vec4 pos = swizzleRight4(aPos1, aPos2, mask);

	// Rest of the code follows...
}

Corresponding TypeScript code:

const enum POINT {
	FLOAT_BYTES = 4,
	COORD_DIMENSION = 4,
	COORD_STRIDE = FLOAT_BYTES,
	COORD_SIZE = COORD_DIMENSION * FLOAT_BYTES,
	// Quads take 2 triangles, 3 elements each.
	ELEMENT_COUNT = 6,
	// Maximum 2^14 (index array allows 2^16 elements, 4 needed per point).
	MAX_COUNT = 16384
}

// Initialize data here...

gl.bindBuffer(gl.ARRAY_BUFFER, anchorBuffer);
gl.bufferSubData(gl.ARRAY_BUFFER, 0, anchorData);

for(let i = 0; i < 2; ++i) {
	gl.vertexAttribPointer(
		aPos1 + i,
		POINT.COORD_DIMENSION,
		gl.FLOAT,
		false,
		POINT.COORD_STRIDE,
		POINT.COORD_SIZE * i
	);
}

gl.drawElements(gl.TRIANGLES, anchorCount * POINT.ELEMENT_COUNT, gl.UNSIGNED_SHORT, 0);

Transforming and joining vertices

The vertex shader receives for each vertex its number 0-3 and the quad's center position. The vertex number's two bits also represent corner coordinates of a unit square. We can transform the unit square to match the position and add additional attributes for other transformations, colors etc. Here's a shader snippet for generating the corner vertex positions:

// pos is initialized in an earlier snippet.

float s = sin(angle);
float c = cos(angle);

mat2 rotation = mat2(c, s, -s, c);

float x = mod(mod4, 2.0);
float y = (mod4 - x) * 0.5;
vec2 corner = vec2(x, y) - 0.5;

// Scale corners by rectangle size.
vPos = corner * size;

gl_Position = uTransform * vec4(pos.xy + rotation * vPos, 0, 1);

After we have coordinates for 4 vertices per sprite waiting on the GPU, they can be used to draw triangles. A single rectangle per point requires 2 triangles, or total 6 vertices but 2 are shared between both triangles. We can use an element array to connect the vertices like this:

0 1 2 3

Drawing clockwise, we can form triangles by connecting points 0-1-2 and 3-2-1. The following quad would use 4-5-6 and 7-6-5. Here's one way to fill the buffer:

for(let i = 0; i < POINT.MAX_COUNT * POINT.ELEMENT_COUNT; ++i) {
	const j = i % 6;
	indexData[i] = (i - j) / 6 * 4 + 3 - Math.abs(j - 3);
}

In WebGL 1.0 without extensions, element index arrays can only hold numbers between 0-65535 giving us maximum 16384 quads per draw call. Then the largest useful element array contains constant 98304 indices which only need to be transferred once to the GPU.

Could we do more work per draw call? The options are:

  • Enable the OES_element_index_uint extension or WebGL 2, and use Uint32Array for indices.
  • Avoid element arrays, transfer all points twice and draw two triangles per point.
  • Avoid element arrays and use triangles instead of quads for the points.

With the latter two options, we still transfer one third of the required data compared to an identical implementation without the buffer swizzling trick.

Filling triangles

Drawing bitmap sprites using textures is pretty trivial. Let's instead use a signed distance function (SDF) -inspired fragment shader to draw circles, rectangles and rounded rectangles with nice, anti-aliased borders. The result is of higher quality than native WebGL (multi-sampling based) anti-aliasing which can be turned off for additional speed.

Once we have a distance measure in pixels, it's easy to switch from fill to border to exterior color at specific distances, with a √2 pixels long ramp using linear interpolation between them. For diagonal edges √2 looks slightly better than 1 pixel, and linear interpolation is simpler but looks just as good as smoothstep.

Vertex shader

After the buffer swizzling trick introduced earlier, we set up some varyings. The vPos works like "texture coordinates". It's in pixel units for easier anti-aliasing because for markers of all sizes, edges need a √2 pixels wide linear gradient. Origin is at the marker's center, and the coordinate system rotates together with the quad so rectangle edges always remain axis-aligned within the texture coordinate system.

// vPos is initialized in an earlier snippet.

// Compare distances to the shorter side.
vOuter = min(size.x, size.y) * 0.5;
vInner = vOuter - border;
vSquare = vOuter - radius;

// Difference of sides from the shorter side.
vDiff = size * 0.5 - vOuter;

Fragment shader

We calculate a distance measure len from the marker's midpoint and draw it with bright colors for debugging:

gl_FragColor = vec4(
	mod(floor(len / 2.0), 3.0) * 0.5,
	mod(floor(len / 4.0), 3.0) * 0.375,
	mod(floor(len / 8.0), 3.0) * 0.375,
	1.0
);

For the final result with nice anti-aliased edges we use linear interpolation and clamp:

gl_FragColor = mix(
	vStroke,
	vFill,
	clamp((vInner - len) * uBlur, 0.0, 1.0)
) * clamp((vOuter - len) * uBlur, 0.0, 1.0);

Now we need a parameterized distance measure len that produces a rounded rectangle. Then with corner radius zero we get sharp corners, and for squares with corner radius equal to half the side we get perfect circles. Let's design it step by step:


A circle has probably the simplest distance measure.

vec2 pos = vPos;
// Euclidean distance.
float len = length(pos);

Make it a square...

vec2 pos = abs(vPos);
// Manhattan distance.
float len = pos.x + pos.y;

Rotate to match the quad.

vec2 pos = abs(vPos);
// Manhattan distance on a grid rotated 45 degrees.
float len = (pos.x + pos.y + abs(pos.x - pos.y)) * 0.5;

Fix the proportions.

// Subtract length difference from the longer side.
vec2 pos = (abs(vPos) - vDiff);
// Manhattan distance on a grid rotated 45 degrees.
float len = (pos.x + pos.y + abs(pos.x - pos.y)) * 0.5;

For the final step, combine everything above.

// Calculate a distance measure from the rounded square's midpoint.
// First subtract length difference from the longer side.
vec2 pos = (abs(vPos) - vDiff);

// Up to the centers of circles forming the rounded corners,
// use Manhattan distance on a grid rotated 45 degrees.
float len = min((pos.x + pos.y + abs(pos.x - pos.y)) * 0.5, vSquare);

// Use Euclidean distance within the rounded corners and edges in between.
pos -= min(pos, vSquare);
len += length(pos);

Note that the resulting shaders are branchless thanks to the bitwise manipulations to get quad corner coordinates and a suitably designed distance measure. This means the GPU can execute everything in parallel without complications.

Further research

You can see a working animated demo with editable source code at the top of this article. Take and adapt it for your own projects, the license is as friendly as it could possibly be.

While the tricks introduced here are somewhat useful for drawing points, they're absolutely magical for polylines. Stay tuned...

Photo by Melissa Poole on Unsplash.
CC BY-NC-SA 4.0 This article is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
CC0 1.0 All code examples and article comments are licensed under Creative Commons Zero 1.0 Universal.