Categories
Uncategorized

Duplicate files, different names – Python

The Problem

We have a bunch of files, and some of them have the exact same contents but different names.

Here’s what we do.

We open the files, read the contents and get the md5. We then compare it and put it all in a dictionary.

Example

we have three files file1.txt, file2.txt and file3.txt, where file2.txt and file3.txt have the exact same contents.

So we open each file and get their md5, which will look a bit like this

file1.txt c1146cdcef57dd17a1f11903887fa9e8
file2.txt 89954dfcfbaada50b21e6a7ddc188f9a
file3.txt 89954dfcfbaada50b21e6a7ddc188f9a

As you can see, file2.txt and file3.txt have the exact same md5.

What we want to end up with is a dictionary like this

{'c1146cdcef57dd17a1f11903887fa9e8' : ['file1.txt'],
 '89954dfcfbaada50b21e6a7ddc188f9a' : ['file2.txt','file3.txt']}

Here’s how we do it.

The Code

from os import listdir
from hashlib import md5

if __name__ == '__main__':
    dict_files = {}
    for str_file in listdir('.'):
        if str_file.endswith('.txt'):
            with open(str_file,'r') as fr_data:
                str_data = fr_data.read()
            str_md5 = md5(str_data).hexdigest()

            if str_md5 not in dict_files:
                dict_files[str_md5] = []

            dict_files[str_md5].append(str_file)

    print dict_files

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s