This seems to be the key part of your concept. By default, a collection of instructions in Processing / p5.js all apply to a single frame (~1/60th sec) – there is no automatic temporal dimension / animation the way that there is in, say, Logo, and no built-in heirarchy in the way of say, Context Free.
So, this tree-like random windmill code describes a trunk, branches, and leaves which evolve randomly. However, it draws every element every frame by default, so all the associated sounds would play all at once, at 60fps.
So in order to align sounds with visual instructions in a sensible way – that animates a tree, with different tones for trunk, branches, leaves – you would need some kind of framework to infer that animation, otherwise the sketches themselves will be quite complex, and not appropriate for learners.
Depending on your animation framework concepts, that would give you the sound attachment points. So, if you were doing recursive descent and trying to animate a tree, you need to know whether to play each branch and its leaves (bang ding ding, bang ding ding) or every branch and then every leaf (bang bang, ding ding ding ding).
It is an interesting problem! Would love to hear more ideas about it.