Cumulation is the current WIP title for the image/video sorting software I’m working on.
I have a big collection of images, screenshots, video clips, etc. spread across many many folders and hard drives and the plan is to use this software to properly sort and manage that collection.
I also plan to add duplicate detection and since that step is probably going to be one of the more complex ones, I decided to start with that.
Duplicate Detection
What I’ve been doing so far is set up a Tauri project and send an array of folder paths to the rust backend of the application.
This works without issues just using a Vec<String>
on the rust side.
Walking the directory I’ve implemented so far via walkdir as follows:
curr_path
here is simply the current path converted from String that the frontend sent back here.
Next I looked at VideoDuplicateFinder
’s code to find some inspiration on how exactly I would implement image duplicate detection. This is done via a method called GetGrayscaleValues
, which takes in an Image
and a darkness threshold percentage and returns an array of byte
s.
videoduplicatefinder/VDF.Core/Utils/GrayBytesUtils.cs:41
This method will basically do this in sequence:
- Take in an
Image
with 16x16 pixel size - Initialize a
byte[]
with a size of 16x16 (256) - Loop through every pixel in the image (1024) in steps of 4
- Ignore the alpha channel but extract red/green/blue values
- Calculate the perceived (relative) luminance of the pixel with some slightly modified values
- Add the red value of the pixel to the buffer
- Check if the pixel is bright enough to be perceived (value over
0x20 (32)
) - Return the buffer if enough pixels (higher than parameter set threshold) are perceivable
In my humble opinion this should be called GetRedValues
since I don’t think the image is put into grayscale anywhere even before running through this method but hey what do I know.
It’s copied as Bgra32
which according to the documentation
Gets the Bgra32 pixel format. Bgra32 is a sRGB format with 32 bits per pixel (BPP).
Each channel (blue, green, red, and alpha) is allocated 8 bits per pixel (BPP).
Does nothing more than allocate 8 bits (1 byte) per channel per pixel.
Now my version of this looks a bit uglier, however it does the same thing. There’s just one issue…
Accuracy of the algorithm
Now VideoDuplicateFinder works just fine for videos and probably pictures too, although I haven’t tested this extensively, however this kind of algorithm is suited a lot better for actual photographs. Video game screenshots or artworks fair a lot worse with this algorithm.
Now to detect differences between images, all it does is take that byte[]
and check the absolute difference between the first and second image’s red values. It also does some checking for available CPU features to make this process a bit faster but we’ll ignore all that optimization stuff for now.
videoduplicatefinder/VDF.Core/Utils/GrayBytesUtils.cs:83
I put two distinctly different images through it and got an absolute difference of about 20%, so it’s safe to say that this isn’t the most accurate method to detect differences I assume. However the default threshold for differences is set to < 4%
so maybe in the very top percentages an overlap is highly unlikely?
Anyway I’ll look into more sophisticated duplicate detection algorithms later, for now this seems to be a good starting point, as does this on a lower level.
Reading the implementation for a pHash
algorithm today has almost given me an aneurysm so I’ll write a bit about that.
https://github.com/aetilius/pHash/blob/master/src/pHash.cpp#L305
In here we try to perform a discrete cosine translation on image values and then retrieve the pHash.
To do this we first construct a dct
matrix of size 32x32
Then we try to load the image from the file path provided:
And after successfully loading the image we first construct an instance of CImg<float>
as a tuple called meanfilter
. Let me explain the values here since this took me a minute.
CImg
defines multiple possible constructors for the class, one of which takes exactly 5 values:
Now most of this is understandable. size_x
is the width, size_y
is the height, however the next 2 took a bit of digging but size_z
is the bit depth of the image (90% confidence it’s that anyway).
Now 1 bit images look awful but this is a matrix we construct specifically for convolution so it doesn’t matter. size_c
is the spectrum of the image and lastly &value
is just what to initially fill every position with.
So all this does is construct a matrix that looks as follows:
Next we declare an empty image and then check the spectrum (amount of channels per pixel) of our original image. 3 would be RGB, 4 could be RGBA or CMYK apparently it doesn’t differentiate.
We then convert those to YCbCr
, which is a family of color spaces, that are basically defined as:
Y
= luminanceCb
= Blue differenceCr
= Red difference
We then convolve the 0
th channel of that color space, that being Y (luminance) with our previously constructed meanfilter
.
Convolving matrices for everyone uninitiated basically just means that you take the our meanfilter
in this case, overlay it on top of the image’s matrix of luminance values step by step and then multiply every overlapping value. Afterwards we sum them together into the center pixel and fill the matrix segment with duplicates of that value. This will give us the mean or average of all the pixel luminance values per segment.