It is kind of difficult to implement wide neural networks on a CPU because the amount of compute required scales according to width by width.
A dense neural layer of width 256 needs 256 by 256 fused multiply adds, and it just gets worse from there.
Using Switch Net the real limit is the L2 cache size of the CPU you are using. Beyond that the fast transform used starts to slow down too much as it repeatedly has to fecth data off the DRAM chips rather than internal CPU memory. Within that you are sort of getting GPU type performance from a CPU.
I think it is worth experimenting with, you might find a use for it.
Here is a sketch with the Switch Net neural network code:
At the end of this blog post about the subject, in the comments, there is a link to some regular (Java) processing code: https://ai462qqq.blogspot.com/2023/04/switch-net.html