Compute Shaders in Processing

I’m working on a gradient library for Processing and I am exploring the use of colour spaces other than RGB to use as the basis to interpolate between different colors (to generate the gradient).

One of these alternative colour spaces is LAB, which generally provides much better gradient results than interpolating between the R,G,B components of two RGB colours.

The downside is that one must convert the colours in the gradient from RGB to LAB, interpolate using the LAB components to produce a new LAB colour, and then convert that back into RGB (to give to Processing to draw).

The interpolation is fairly inexpensive, but the RGB–>LAB and LAB–>RGB conversion is very expensive. In testing, using the LAB colorspace (with these additional 2 conversions for every pixel)
is about 15x slower than working in RGB only.

Therefore, I’ve thought about translating the RGB<–>LAB Java methods into a glsl compute shader to get the GPU to do the heavy lifting. These functions are maths-only which I believe can be ported into glsl quite well. However, I don’t have experience writing a computer shader, nor using one in Processing.

So, I have two questions:

  • Has anyone used compute shaders in Processing?
  • Where does one begin translating a maths-heavy Java method into a compute shader?
1 Like

I do not think that you need compute shaders here because a simple fragment shader would do it for generating gradients. I have added you a simple example which currently uses RGB values to calculate the gradient. I leave it up to you to extend it with the lab conversion. But as a hint, check out this shadertoy example: https://www.shadertoy.com/view/wt23Rt

PShader gradientShader;
PGraphics canvas;

void setup() {
  size(500, 500, P2D);

  canvas = createGraphics(400, 400, P2D);

  gradientShader = createShader();
}

void draw() {
  background(55);

  gradientShader.set("resolution", canvas.width, canvas.height);
  gradientShader.set("startColor", 1.0, 0, 0); // red
  gradientShader.set("endColor", 0, 0, 1.0); // blue
  canvas.filter(gradientShader);

  imageMode(CENTER);
  image(canvas, width / 2, height / 2);
}

public PShader createShader() {
  String[] vert = new String[] {
    "#version 150", 

    "uniform mat4 transform;", 

    "in vec4 position;", 
    "in vec4 color;", 

    "out vec4 vertColor;", 

    "void main() {", 
    "  gl_Position = transform * position;", 
    "  vertColor = color;", 
    "}"
  };

  String[] frag = new String[] {
    "#version 150",

    "in vec4 vertColor;", 
    "out vec4 fragColor;", 
    
    "uniform vec2 resolution;",
    
    "uniform vec3 startColor;",
    "uniform vec3 endColor;",

    "void main() {",
    "  vec2 coordinates = gl_FragCoord.xy / resolution;",
    "  vec3 value = mix(startColor, endColor, coordinates.x);",
    "  fragColor = vec4(value.xyz, 1.0);",
    "}"
  };

  return new PShader(this, vert, frag);
}

Preview

1 Like

Thanks for the reply.

It seems that this technique would tie the main sketch to OpenGL mode since the PShader() constructor needs an OpenGL PApplet, and not simply an OpenGL PGraphics object. With a library, this is what I was trying to avoid.

Is there a way to get around this: using PShaders without the need for the parent sketch to be in OpenGL mode? Alternatively, circumventing PShaders and going via JOGL directly?

1 Like

Of course you can run it directly via JOGL, but then you would have to dig inside the framework and risk the break with the API of processing in future versions. Maybe an openCL kernel could help here, but again, you would have to include JOCL and get into it (also a bit tricky in the beginning). Also you would have to do something with the returned data (set each pixel value), which again can take a lot of time because it first has to be moved back to CPU memory and pixel access is slow in java.

I would suggest that you share a simple example (one-page-sketch), which shows the slow performance and we will have a look at it. Because regarding the profiling you showed: Math.pow method takes a long time to calculate, have you tried replacing it with value * value? And what kind of precision are you using for your calculations? How do you iterate over the pixels and so on…

2 Likes

For a linear gradient, I’m iterating pixels like this:

for (int i = 0, y = 0, x; y < height; ++y) {
	for (x = 0; x < width; ++x, ++i) {
		float step = Functions.project(ox, oy, dx, dy, x, y);
		gradientPG.pixels[i] = gradient.eval(step);
	}
}

This iteration can be multi-threaded on the CPU quite easily (where each thread writes to a portion of the PGraphics pixel array):

Code...
final int threadCount = 4;
final int pixels = gradientPG.pixels.length / threadCount;
ExecutorService exec = Executors.newFixedThreadPool(threadCount);

gradientPG = p.createGraphics((int) dimensions.x, (int) dimensions.y);
gradientPG.beginDraw();
gradientPG.loadPixels();

HashSet<LinearThread> threads = new HashSet<>(threadCount);
for (int i = 0; i < threadCount; i++) {
	threads.add(new LinearThread(i * pixels, pixels, ox, oy, dx, dy, gradient));
}

try {
	exec.invokeAll(threads); // run threads, block until all finished
} catch (InterruptedException e) {
	e.printStackTrace();
}

gradientPG.updatePixels();
gradientPG.endDraw();

Where a LinearThread is this:

Code...
public class LinearThread implements Callable<Boolean> {

	final int offset, pixels;
	final float ox, oy, dx, dy;
	final Gradient gradient;

	public LinearThread(int offset, int pixels, float ox, float oy, float dx, float dy,
			Gradient gradient) {
		this.offset = offset;
		this.gradient = new Gradient();
		this.dx = dx;
		this.dy = dy;
		this.ox = ox;
		this.oy = oy;
		this.pixels = pixels;
	}

	@Override
	public Boolean call() {
		for (int i = offset; i < offset + pixels; i++) {
			float step = Functions.project(ox, oy, dx, dy, i % p.width, (i - i % p.width) / p.height);
			gradientPG.pixels[i] = gradient.eval(step);
		}
		return true;
	}

}

The Gradient class (a wrapper for a gradient’s colour and colour stops) eval() method looks like this. Given a position (step) in the gradient, it calculates the color at that point using the current colour space. However, the time this method takes is dominated by the maths methods during LAB conversion.

Code...
int eval(final float step) {
	final int size = colorStops.size();

//		 Exit from the function early whenever possible.
	if (size == 0) {
		return 0x00000000;
	} else if (size == 1 || step < 0.0) {
		return colorStops.get(0).clr;
	} else if (step >= 1.0) {
		return colorStops.get(size - 1).clr;
	}

	ColorStop currStop;
	ColorStop prevStop;
	float currPercent, scaledst;
	for (int i = 0; i < size; i++) {
		currStop = colorStops.get(i);
		currPercent = currStop.percent;

		if (step < currPercent) {

			// These can be declared within the for-loop because
			// if step < currPercent, the function will return
			// and no more iterations will be executed.
			float[] originclr = new float[4];
			float[] destclr = new float[4];
			float[] rsltclr = new float[4];

			// If not at the first stop in the gradient (i == 0),
			// then get the previous.
			prevStop = colorStops.get(i - 1 < 0 ? 0 : i - 1);

			scaledst = step - currPercent;
			float denom = prevStop.percent - currPercent;
			if (denom != 0) {
				scaledst /= denom;
			}

			// Assumes that color stops' colors are ints. They could
			// also be float[] arrays, in which case they wouldn't
			// need to be decomposed.
			switch (colorSpace) {
				// TODO CALC STEP HERE ?
				case HSB :
					Functions.rgbToHsb(currStop.clr, originclr);
					Functions.rgbToHsb(prevStop.clr, destclr);
					Functions.smootherStepHsb(originclr, destclr, scaledst, rsltclr);
					return Functions.composeclr(Functions.hsbToRgb(rsltclr));
			
				case RGB :
					Functions.decomposeclr(currStop.clr, originclr); // decompose int color to [R,G,B,A] (0-1)
					Functions.decomposeclr(prevStop.clr, destclr); // decompose int color to [R,G,B,A] (0-1)
					Functions.smootherStepRgb(originclr, destclr, scaledst, rsltclr); // lerp between colors?
					return Functions.composeclr(rsltclr);
					
				case LAB : 
					double[] rsltclr1 = new double[4];
					double[] labA = LAB.rgb2lab(Functions.decomposeclr(currStop.clr)); // RGB->LAB
					double[] labB = LAB.rgb2lab(Functions.decomposeclr(prevStop.clr));
					LAB.interpolate(labA, labB, scaledst, rsltclr1); // stil in LAB
					float[] inter = LAB.lab2rgb(rsltclr1);
					return Functions.composeclr(inter[0], inter[1], inter[2]);

				case FAST_LAB :
					double[] rsltclr2 = new double[4];
					double[] labC = FastLAB.rgb2lab(Functions.decomposeclr(currStop.clr)); // RGB->LAB
					double[] labD = FastLAB.rgb2lab(Functions.decomposeclr(prevStop.clr));
					FastLAB.interpolate(labC, labD, scaledst, rsltclr2); // stil in LAB
					float[] inter1 = FastLAB.lab2rgb2(rsltclr2);
					return Functions.composeclr(inter1[0], inter1[1], inter1[2]);
				default :
					break;
			}
		}
	}
	return colorStops.get(size - 1).clr;
}

Regarding the profiling, I’ve since replaced the Processing methods with jafama.FastMath, and I’m now working in doubles only until the very end (so only a single cast to float, rather than casting in every method) – these changes have doubled the speed.

I’ve developed a version that uses the optimised approximate methods pow(), ln() and exp() methods from here; these are indeed much faster (x5), but sadly too imprecise, often making a noticeable impact in the gradient results.

Example: Left gradient uses FastMath methods; right gradient uses approximate methods – this is a more severe example.

The LAB<–>RGB conversion uses fraction exponents, logs, etc (that’s why it’s so heavy) Here’s the conversion class:

Code...
final class LAB {

	// D65 standard referent
	private static final double Xn = 0.950470;
	private static final double Yn = 1.0;
	private static final double Zn = 1.088830;

	private static final double t0 = 0.137931034; // 4 / 29
	private static final double t1 = 0.206896552; // 6 / 29
	private static final double t2 = 0.12841855; // 3 * t1 * t1
	private static final double t3 = 0.008856452; // t1 * t1 * t1

	private static final double third = 1 / 3.0f;

	/**
	 * sRGB (0-255) to LAB (CIE-L*ab)
	 * 
	 * @param rgb
	 * @return
	 */
	public static double[] rgb2lab(float[] rgb) {
		double[] xyz = rgb2xyz(rgb[0], rgb[1], rgb[2]);
		double l = 116 * xyz[1] - 16;
		if (l < 0) {
			return new double[0]; // todo?
		}
		return new double[] { l, 500 * (xyz[0] - xyz[1]), 200 * (xyz[1] - xyz[2]) };
	}

	private static double rgb_xyz(double r) {
		if ((r /= 255) <= 0.04045)
			return r / 12.92;
		return FastMath.pow((r + 0.055) / 1.055, 2.4);
	}

	private static double xyz_lab(double t) {
		if (t > t3) {
			return FastMath.pow(t, third);
		} else {
			return t / t2 + t0;
		}
	}

	private static double[] rgb2xyz(double r, double g, double b) {
		r = rgb_xyz(r);
		g = rgb_xyz(g);
		b = rgb_xyz(b);
		final double x = xyz_lab((0.4124564 * r + 0.3575761 * g + 0.1804375 * b) / Xn);
		final double y = xyz_lab((0.2126729 * r + 0.7151522 * g + 0.0721750 * b) / Yn);
		final double z = xyz_lab((0.0193339 * r + 0.1191920 * g + 0.9503041 * b) / Zn);
		return new double[] { x, y, z };
	}

	// --------------------------------------------------------------------------------------------

	static final float Xn2 = 95.04f;
	static final float Yn2 = 100;
	static final float Zn2 = 108.89f;

	public static float[] lab2rgb(double[] lab) {
		return xyz2rgb(lab2xyz(lab));
	}

	/**
	 * https://github.com/Heanzy/bupt-wechatapp/tree/master/src/main/java/com/buptcc/wechatapp/utils
	 * @param Lab
	 * @return
	 */
	private static double[] lab2xyz(double[] Lab) {
		double[] XYZ = new double[3];
		final double L, a, b;
		final double fx, fy, fz;

		L = Lab[0];
		a = Lab[1];
		b = Lab[2];

		fy = (L + 16) / 116;
		fx = a / 500 + fy;
		fz = fy - b / 200;

		if (fx > 0.2069) {
			XYZ[0] = (Xn2 * FastMath.pow(fx, 3));
		} else {
			XYZ[0] = Xn2 * (fx - 0.1379f) * 0.1284f;
		}

		if ((fy > 0.2069) || (L > 8)) {
			XYZ[1] = (Yn2 * FastMath.pow(fy, 3));
		} else {
			XYZ[1] = Yn2 * (fy - 0.1379f) * 0.1284f;
		}

		if (fz > 0.2069) {
			XYZ[2] = (Zn2 * FastMath.pow(fz, 3));
		} else {
			XYZ[2] = Zn2 * (fz - 0.1379f) * 0.1284f;
		}

		return XYZ;
	}

	private static float[] xyz2rgb(double[] XYZ) {
		final float[] sRGB = new float[3];
		final double X, Y, Z;
		double dr, dg, db;
		X = XYZ[0];
		Y = XYZ[1];
		Z = XYZ[2];

		dr = 0.032406 * X - 0.015371 * Y - 0.0049895 * Z;
		dg = -0.0096891 * X + 0.018757 * Y + 0.00041914 * Z;
		db = 0.00055708 * X - 0.0020401 * Y + 0.01057 * Z;

		if (dr <= 0.00313) {
			dr = dr * 12.92;
		} else {
			dr = (FastMath.exp(FastMath.log(dr) / 2.4) * 1.055 - 0.055);
		}
		

		if (dg <= 0.00313) {
			dg = dg * 12.92;
		} else {
			dg = (FastMath.exp(FastMath.log(dg) / 2.4) * 1.055 - 0.055);
		}

		if (db <= 0.00313) {
			db = db * 12.92;
		} else {
			db = (FastMath.exp(FastMath.log(db) / 2.4) * 1.055 - 0.055);
		}

		dr = dr * 255; // 255 here TODO
		dg = dg * 255;
		db = db * 255;

		dr = FastMath.min(255, dr);
		dg = FastMath.min(255, dg);
		db = FastMath.min(255, db);

		sRGB[0] = (int) (dr + 0.5);
		sRGB[1] = (int) (dg + 0.5);
		sRGB[2] = (int) (db + 0.5);

		return sRGB;
	}

	/**
	 * originclr, destclr, scaledst, rsltclr
	 * 
	 * @param a   LAB col 1
	 * @param b   LAB col 2
	 * @param st  step
	 * @param out new col
	 * @return
	 */
	public static double[] interpolate(double[] a, double[] b, float step, double[] out) {
		step = Functions.calcStep(step);
		out[0] = a[0] + step * (b[0] - a[0]);
		out[1] = a[1] + step * (b[1] - a[1]);
		out[2] = a[2] + step * (b[2] - a[2]);
		return out;
	}
}

Looking at it just now, I think these two lines:

double[] labA = LAB.rgb2lab(Functions.decomposeclr(currStop.clr));
double[] labB = LAB.rgb2lab(Functions.decomposeclr(prevStop.clr));

… can be computed just once for the color stop (rather than for every pixel!!!)

2 Likes

Update:

Turns out jafama provides __Quick versions of the methods with less accuracy. The speed up is roughly the same speed up as the optimised approximate methods used before, but the quality is much better.

.

Better quality: not identical, but less banding than prior quick methods!

Another optimisation was refactoring the 3 cases of integer powers to multiplication:
XYZ[0] = (Xn2 * FastMath.pow(fx, 3)); to
XYZ[0] = (Xn2 * fx * fx * fx);

And the big one: calculating the LAB value of a colorstop during its construction (once) and not recalculating every pixel. Now, for each pixel, we call lab2rgb() only, rather than a call to lab2rgb and two calls to rgb2lab(). The new LAB part in eval() looks like this:

double[] rsltclr1 = new double[4];
LAB.interpolate(currStop.labclr, prevStop.labclr, smoothStep, rsltclr1);
return LAB.lab2rgb(rsltclr1);

With all these optimisations, on a single thread, I’ve gone from ~1fps to ~23fps at 800x800 pixels (In contrast, RGB interpolation runs at ~37fps).


Another trick I’ve just thought of is rendering every nth pixel, which seems to work very well for gradients:

Modifying main loop to this:

int recentPixel = 0;
for (int i = 0, y = 0, x; y < height; ++y) {
	for (x = 0; x < width; ++x, ++i) {
		if (i % n == 0) {
			float step = Functions.project(ox, oy, dx, dy, x, y);
			recentPixel = gradient1.eval(step);
		}
		gradient.pixels[i] = recentPixel;
	}
}

Results:

n=1 (render every pixel, as before)


n=2

n=4

n=8 (banding is finally noticeable)

2 Likes