Create MD5 Hash of a File in Python

MD5 is (atleast when it was created) a standardized 1-way function that takes in data input of any form and maps it to a fixed-size output string, irrespective of the size of the input string.

Though it is used as a cryptographic hash function, it has been found to suffer from a lot of vulnerabilities.

The hash function generates the same output hash for the same input string. This means that, you can use this string to validate files or text or anything when you pass it across the network or even otherwise. MD5 can act as a stamp or for checking if the data is valid or not.

For example -

Input String Output Hash
hi 49f68a5c8493ec2c0bf489821c21fc3b
debugpointer d16220bc73b8c7176a3971c7f73ac8aa
satvik 18457af9e2ed5d80e3d946810e189f71
computer science is amazing! I love it. f3c5a497380310d828cdfc1737e8e2a3

Check this out - If you are looking for MD5 hash of a String.

Create MD5 hash of a file in Python

MD5 hash can be created using the python's default module hashlib.

Incorrect Way to create MD5 Hash of a file in Python

But, you have to note that you cannot create a hash of a file by just specifying the name of the file like this-

# this is NOT correct
import hashlib
print(hashlib.md5("filename.jpg".encode('UTF-8')).hexdigest())

Output of the above code-

03e6eda992afdeda6b2acaed17722515

The above value is NOT the MD5 hash of the file. But, it is the MD5 hash of the string filename.jpg itself.

Correct Way to create MD5 Hash of a file in Python

You have to read the contents of the file to create MD5 hash of the file itself. It's simple, we can just read the contents of the file and create the hash.

import hashlib

file_name = 'filename.jpg'

with open(file_name) as f:
    data = f.read()    
    md5hash = hashlib.md5(data).hexdigest()

MD5 Hash of Large Files in Python

In the above code, there is one problem. If the file is a 10 Gb file, let's say a large log file or a dump of traffic or a Game like FIFA or others. If you want to compute MD5 hash of it, it would probably chew up your memory. Here is a memory optimised way of computing MD5 hash, where we read chunks of 4096 bytes(can be customised as per your requirement, size of your system, size of your file etc.,). So, in this process we sequentially process the chunks and update it.

import hashlib

# A utility function that can be used in your code
def compute_md5(file_name):
    hash_md5 = hashlib.md5()
    with open(file_name, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Compare and Verify MD5 hash of a file using python

Here you can compare the original MD5 value that the source has generated and MD5 that you generate.

import hashlib

file_name = 'filename.jpg'

original_md5 = '5d41402abc4b2a76b9719d911017c592'  

with open(file_name) as f:
    data = f.read()
    md5_returned = hashlib.md5(data).hexdigest()

if original_md5 == md5_returned:
    print "MD5 verified."
else:
    print "MD5 verification failed."