How to get pitch of human voice

I’m trying to make a sketch that find the pitch of a human voice (speaking or singing). I thought I was headed in the right direction, but I’m getting stuck.

Here is my strategy:

  • Loop through all the frequencies in the range of the typical human voice (about 50hz to 400hz).
  • Use getEnergy() to find the “energy” of each of these frequencies.
  • From here, find the frequency with the highest energy. This, I assume, is the primary frequency, or pitch, of the voice at that moment in time.

This almost works, but the frequencies seem to be grouped in a weird way. I’m not sure how to describe it precisely. I think this picture can explain better:

I am playing a tone of 200hz on the right. My code generated the chart on the left. Notice how the frequencies group themselves in bands of about 50. This makes it hard to get a precise pitch!

Here is my code:

var mic;
var spectrum;
var fft;
var maxes = []
const SENSITIVITY = 50; // How fast the avg moves
const LOWER_BOUND = 150; // Hz
const UPPER_BOUND = 300; // Hz
var gap = 2 // Granularity - gap between frequencies that are tested.

function setup()
	createCanvas(640, 600);
	mic = new p5.AudioIn()
	fft = new p5.FFT();

function draw()
	micLevel = mic.getLevel();

	var max = {freq: 0, energy: 0}
	for (let freq = LOWER_BOUND; freq < UPPER_BOUND; freq += gap) {
		// Labels
		if (freq % 10 === 0) {
			const y = getY(freq)
			text(freq, 550, y)

		// Get energy of this freq, and draw it
		const energy = fft.getEnergy(freq, freq + gap)
		drawFreqLine(freq, 5, energy)

		// Find primary frequency
		max = (energy > ? {freq, energy} : max
	// Rolling average of the primary frequency. This makes the graph less jumpy.
	if (micLevel > 0.001) {
	if (maxes.length > SENSITIVITY) {
		maxes.splice(0, 1)
	const sumOfMaxFrequencies = maxes.reduce((s, m) => s + m.freq, 0)
	const rollingAvgFreq = sumOfMaxFrequencies / maxes.length
	drawFreqLine(rollingAvgFreq, 10, 10);

function drawFreqLine(freq, thickness=10, energy=255)
	const line = {
		x: 0,
		y: getY(freq),
		w: width * energy/255,
		h: thickness,
	fill(0, 50, 250);
	rect(line.x, line.y, line.w, line.h);

function getY(freq)
	return map(freq, LOWER_BOUND, UPPER_BOUND, height, 0);

1 Like

Suspicious. Are you certain that the tone you are generating is actually a distributed set of values on a curve? The repeat values could be a mathematical artifact of how the test tone is being generated – perhaps a compression artifact could also do this. A quick first thing to do is to test with a microphone – or digital audio file playback – and see if your graph still has steps.

Perhaps test with audio of people speaking or singing.

Here’s another screenshot. This one is a human voice (Nina Simone singing). It is much less “artificial”, and much less “clean” than the generated tone. I still see the stepped effect.

I recorded the same clip and analyzed it in Adobe Audition. It shows a pitch of about 190Hz, so my chart is roughly correct. (But a little too rough.)

Your suggestion that the stepping might be a result of some sort of compression artifact is interesting. It made me wonder if perhaps there’s something funny going on because the mic and audio source (speaker) are in both in my laptop. I tested it many times with my voice as the source, and I see the same stepped effect.

I’d like to break this problem down into a simpler case. I’m not sure exactly how to do that, though I’ll keep playing around and see if I can find an angle. Advice is most appreciated.

Sounds like a good idea. This is @Kevin’s go-to advice for such situations.

Unfortunately I’m not an audio expert, particularly not with p5.js, so I don’t have a lot to offer.

I haven’t inspected the code, but other things to possibly look for:

  1. rounding errors or integer division (although your data doesn’t look perfectly stepped, only mostly stepped)
  2. async timing problems, where the same slice is being sampled and then it jumps forward. If you run your sample measurement while accessing audio in non-realtime does it have the same behavior?

But these are just guesses.


I believe your bin size is determined by your buffer length (inversely proportional).


For example, if N is chosen to be 1024, the spectrum’s continuous curve will be sampled 513 times between 0 and 0.5. If N is chosen to be much larger than the window length, the samples in the frequency domain will be close enough that the peaks and valleys of the continuous curve will be preserved in the new spectrum. If N is made the same as the window length, the fewer number of samples in the spectrum results in the regular pattern of peaks and valleys turning into irregular tails, depending on where the samples happen to fall.

That’s from a text book on DSP, but the p5.js doc mentions this as well.

FFT analyzes a very short snapshot of sound called a sample buffer. It returns an array of amplitude measurements, referred to as bins . The array is 1024 bins long by default. You can change the bin array length, but it must be a power of 2 between 16 and 1024 in order for the FFT algorithm to function correctly. The actual size of the FFT buffer is twice the number of bins, so given a standard sample rate, the buffer is 2048/44100 seconds long.


Thank you, @Uncle_Jonno! This set me on a track to what seems to be a solution.

I initialized the FFT with many more bins, like this:

  const smoothing = 0.8 // The p5 default
  const bins = 8192 // Way more than the default, which is 1024
  fft = new p5.FFT(smoothing, bins)

More bins means smaller bins. Smaller bins means greater resolution, it seems. Indeed, the stepping got more and more granular as I increased the number of bins.

Here’s a weird thing: The p5 docs say that the maximum number of bins is 1024. I gave it 8192 bins (2^13). It gave me the result I want and seems to work just fine.

I do wish I had a better understanding of exactly how the FFT works, but this seems to do the trick for my purposes.

If anyone is interested, the pitch tracker I made using this code is here:

1 Like