The Problem
We have a bunch of files, and some of them have the exact same contents but different names.
Here’s what we do.
We open the files, read the contents and get the md5. We then compare it and put it all in a dictionary.
Example
we have three files file1.txt
, file2.txt
and file3.txt
, where file2.txt
and file3.txt
have the exact same contents.
So we open each file and get their md5, which will look a bit like this
file1.txt c1146cdcef57dd17a1f11903887fa9e8 file2.txt 89954dfcfbaada50b21e6a7ddc188f9a file3.txt 89954dfcfbaada50b21e6a7ddc188f9a
As you can see, file2.txt
and file3.txt
have the exact same md5.
What we want to end up with is a dictionary like this
{'c1146cdcef57dd17a1f11903887fa9e8' : ['file1.txt'], '89954dfcfbaada50b21e6a7ddc188f9a' : ['file2.txt','file3.txt']}
Here’s how we do it.
The Code
from os import listdir from hashlib import md5 if __name__ == '__main__': dict_files = {} for str_file in listdir('.'): if str_file.endswith('.txt'): with open(str_file,'r') as fr_data: str_data = fr_data.read() str_md5 = md5(str_data).hexdigest() if str_md5 not in dict_files: dict_files[str_md5] = [] dict_files[str_md5].append(str_file) print dict_files