Auto de-duplicate and rename images

6 min

Some time ago I gathered a list of links about image fingerprinting, similarity, perceptual hash.
I recently used an algorithm like that to automatically rename and de-duplicate images on my blog and I think it’s interesting enough to share.

There are two interesting Python libraries that I checked:

Both of them read images and generate a long number (or string) that represents the image in such a way that visually similar images have very similar values.
It doesn’t matter the name of the image file, this is about the content of the image.
Also, it doesn’t matter if the image is exported full quality, or lower quality (JPEG compression).
As a bonus, resizing the image generates very similar values too.

Both algorithms work pretty similarly. They take an image, cycle the pixels in order and add the color values until they generate a very big number.
D-Hash uses the image luminosity levels and generates 2 numbers that are later joined together.
Block-Hash uses all the channels to create a matrix of color values that are later joined.

There are other interesting libraries that generate useful representations of images in a similar way, worth mentioning is: Blurhash. I might create a blog post about that another time.

Evaluation

I started with one image that I took in Dublin a year ago:

Dublin pyramids

I made a few copies of the same image like so:

  • the original image (4032 × 3024px) (11.7 MB on disk)
  • optimized with JPEG optim 90% (4032 × 3024px) (2.9 MB on disk)
  • resized to full HD, starting from the optimized file (1920 × 1440px) (868 KB on disk)
  • resized to 1024px, starting from the optimized file (1024 × 768px) (270 KB on disk)
  • resized to a small thumb (256 × 192px) (25 KB on disk)
  • the 1024px image, edited in Luminar v3, changed some colors and the sky (1024 × 768px) (352 KB on disk)
  • the 1024px image, edited in Photoshop v21, Auto-Tone and Auto-Contrast: (1024 × 768px) (356 KB on disk)

I also added a similar image of the same building, not edited, full size, taken in the same day, from a slightly different angle (4032 × 3024px) (10.7 MB on disk)

The values from both libraries can be represented as a binary string, or a number, or a string. For this comparison I used the HEX representation.

Here are the results:

DHash_hexes = [
  'b0cd0f57f633270ee381080000ffff7c', # Original img
  'b0cd0f57f633270ee381080000ffff7c', # Optimized
  'b0cd0f57f633270ee381080000ffff7c', # full HD size
  'b0cd0f57f633070ee381080000ffff7c', # 1024 px
  'b0cd0f57e633074ee381081000ffff7c', # Thumbnail
  'b0cd0f57f233374ee3c1083000ffff7c', # Luminar edit
  'b0cd0f57f633070ee381080000ffff7c', # Photoshop edit
  'b1cc8f17f6f1270ee3e3080000fbff7e', # the similar image
]

BlockHash_hexes = [
  'f5ead4aa9482d4aaf4aafdfe0000d5aaf5ea808af5ba84aafdfe802afdfe0000', # Original img
  'f5ead4aa9482d4aaf4aafdfe0000d5aaf5ea808af5ba84aafdfe802afdfe0000', # Optimized
  'f5ead4aa9482d4aaf4aafdfe0000d5aaf5fa808af5ba80aafdfe802afdfe0000', # full HD size
  'f5ead4aa9482d4aad4aafdfe0000d5eaf5fa808af5ba842afdfe808afdfe0000', # 1024 px
  'f5ead4aa9482d4aad4aafdfe0000d5eaf5fa808af5ba842afdfe808afdfe0000', # Thumbnail
  'd5ead4aa9482f4aad4aafdfe0000d5eaf5fa848af5aa848afdfe808afdfe0000', # Luminar edit
  'f5ead4aa9482d4aad4aafdfe0000d5eaf5fa848af5ba802afdfe808afdfe0000', # Photoshop edit
  'd5ead5aa9482f4aaf4aafdfe0000d5aaf5fa948af5aa808afdfe900afdfe0000', # the similar image
]

The differences are hard to see, so, I created a matrix of the Hamming distance between the values:

D-Hash diffs

--------------------------------------
0  | 0  | 0  | 1  | 4  | 5  | 1  | 10 | <- Original img
--------------------------------------
0  | 0  | 0  | 1  | 4  | 5  | 1  | 10 | <- Optimized
--------------------------------------
0  | 0  | 0  | 1  | 4  | 5  | 1  | 10 | <- full HD size
--------------------------------------
1  | 1  | 1  | 0  | 3  | 5  | 0  | 11 | <- 1024 px
--------------------------------------
4  | 4  | 4  | 3  | 0  | 5  | 3  | 14 | <- Thumbnail
--------------------------------------
5  | 5  | 5  | 5  | 5  | 0  | 5  | 14 | <- Luminar edit
--------------------------------------
1  | 1  | 1  | 0  | 3  | 5  | 0  | 11 | <- Photoshop edit
--------------------------------------
10 | 10 | 10 | 11 | 14 | 14 | 11 | 0  | <- Similar image
--------------------------------------

Block-Hash diffs

--------------------------------------
0  | 0  | 2  | 5  | 5  | 9  | 7  | 11 | <- Original img
--------------------------------------
0  | 0  | 2  | 5  | 5  | 9  | 7  | 11 | <- Optimized
--------------------------------------
2  | 2  | 0  | 5  | 5  | 9  | 5  | 9  | <- full HD size
--------------------------------------
5  | 5  | 5  | 0  | 0  | 5  | 2  | 12 | <- 1024 px
--------------------------------------
5  | 5  | 5  | 0  | 0  | 5  | 2  | 12 | <- Thumbnail
--------------------------------------
9  | 9  | 9  | 5  | 5  | 0  | 5  | 7  | <- Luminar edit
--------------------------------------
7  | 7  | 5  | 2  | 2  | 5  | 0  | 10 | <- Photoshop edit
--------------------------------------
11 | 11 | 9  | 12 | 12 | 7  | 10 | 0  | <- Similar image
--------------------------------------

Conclusion

In the case of D-Hash, the first 3 images are considered absolutely identical. Even resized down to 1024px the difference is just 1.
The most “dramatic” change seems to be in case of the Luminar edit because I did change a lot in the colors, and the sky.

In the case of Block-Hash, the first 2 images are considered identical, but the full HD one has a difference of 2 vs the original image.
I was surprised to see that Block-Hash considers the 1024px and the Thumbnail to be identical, first of all, because it finds a difference of 5 between the original image and the 1024px, and also because D-Hash finds a difference of 3 between them.

For my blog, I decided to use D-Hash, because the code is much smaller and simpler to understand and it works just as well. If you need better precision, you can still use a bigger resolution images and generate longer hashes.
I encoded the hash to Base32 to keep the image file name shorter.

I made a short Python script to first resize the image, optimize it with JPEG optim, then calculate the hash and rename the image with the Base32 hash.
I will publish it soon, but I need to cleanup the code a little 😅

@articles #image #similarity #hashing