Has anybody implemented before, or know of any resources available, to generate a unique image signature. The signature would be based on image content (pixels). This is my case. I have a folder with a bunch of images. I want to go through all the images and check which ones are repeated. Bonus points if I can get a signature that is independent of image size… ok, maybe this is not that easy but the idea is that if I have two images that differ by a scale factor, that I can also match them (I am going ahead of myself) - the signature generated to be the same “or close”.
In other words, have you done anything like this before? Any algorithm that you would recommend?
Well, the images themselves are the their own signature… each is assured to be unique from all the others, unless they are the same image.
I think what you’e really trying to ask is if you can generate a hash value from an image. Two images that are different would generate different hash values. Two images that are the same would generate the same hash value.
So, if you generate two hash values for two images and they are different, you know the images are different. If they have the same hash value, it might be the case that the images are the same - but you would still have to check the actual images (not their hashed values) to be sure.
One simple approach for hashing, and the one I will use as an example, is to just add up the R, G, and B values of each pixel in the image, and then take the module 255 of that value. This will result in a number between 0 and 255, so for two images that are different, there is only a 1/255 chance that their hashed values will collide.
Other hashing algorithms are certainly possible. Consider treating each image as an array of pixels, and taking 625 pixels from that array at even intervals. This results in a 25 by 25 pixel image that sort of represents the image as a whole. Of course, two images might end up generating the same 625 pixel image, but that is unlikely… You might also consider averaging the colors of the pixels in 625 ranges of equal length to get a hashed image.
Do you need example code written for these approaches?
Why not just compare the pixels directly, if that’s your actual goal?
@Kevin I was thinking in generating a hash, loaded in a DB and let the sql do the work for me. I was looking into writing my own algorithm but I wanted to check first in the community if anybody did something alike before. The images could be almost the same but not exactly. That is part of the challenge and probly a hash algorithm could be to strict for this task. However I would proceed to implement one just for fun.
@TfGuy44 Thxs Thomas. No need for code, unless you have one handy. Otherwise I will just code it myself. I just wanted to have some input. Thxs for your ideas. As a matter of fact, I was considering one of your approaches:
This results in a 25 by 25 pixel image […] You might also consider averaging the colors of the pixels in 625 ranges
For hash values, I was considering a 16 bit CRC or maybe implementing the md5sum . For the CRC, it will be something of the order of this:
uint16 calcCRC(const void *pBuffer, uint16 bufferSize)
const uint8 *pBytesArray = (const uint8*)pBuffer;
uint16 poly = 0x8408;
uint16 crc = 0;
for (j =0; j < bufferSize; j++)
crc = crc ^ pBytesArray[j];
for (i_bits = 0; i_bits < 8; i_bits++)
carry = crc & 1;
crc = crc / 2;
crc = crc^poly;
The trouble with traditional hashing algorithms is that a small change in the input value causes a drastic change in the output hash - for recognizing images that are the same, this might not be what you want.
Is near-duplicate detection / fuzzy matching a goal here? Or bitwise-exact matches only?
If fuzzy, are the images all one file type (jpeg, png)?
Near-dupication ideas are very welcome. I guess near-duplication and fuzzy matching is the same concept? However, my first lick at the can is going to be signatre generation based on mod-like algorithms, as suggested by Thomas.
For now, src imgs are jpeg. However I would play with different formats. If I load the images in P3 and if they came in two or more flavors - bmp, jpeg, png, tiff, etc. - would they match if they were originated from the same image base file?
A key term here is “perceptual hash” – I haven’t done this in a while, and never in Java, but I did a quick roundup of some StackOverflow and Github resources on Java-based perceptual hashing here:
I do not recommend writing your own implementation – use a library if at all possible. This is an extremely well studied problem that is also very fiddly and can have lots of weird edge cases. Published solutions are dramatically better than naive implementations.