Stefano Cappellini

AI, Deep Learning, Machine Learning, Software Engineering

Python hashlib is twice as fast as Java MessageDigest

Written on , in: ,

This is something unexpected. Although comparing different programming languages is hard, Java usually tends to be faster than Python. Benchmarks are pretty rare but the ones available tend to agree all on this (see, for example, this).

If you are in a hurry, feel free to jump to the conclusion ;)

Let’s see this strange behaviour in action! (All the code is available on this gist).

First: hashing big files using multiple update calls

The test code (Java)

public static double comparison(int desiredSize, int bufferSize){
    int size = desiredSize / bufferSize * bufferSize;
    byte data[] = new byte[size];
    new Random().nextBytes(data);
    long startTime = 0;
    long endTime = 0;
    int nRounds = 10;
    double elapsed = 0;
    for(int round = 0; round < nRounds; round++){
        MessageDigest sha = MessageDigest.getInstance("SHA-256");
        startTime = System.nanoTime();
        for(int i = 0; i < size; i += bufferSize)
            sha.update(data, i, bufferSize);
        sha.digest();
        endTime = System.nanoTime();
        elapsed += ((endTime - startTime) / 1000000000.0);
    }
    return elapsed / nRounds;
}

The test code (Python)

def comparison(desired_size, buffer_size):
    size = (desired_size // buffer_size) * buffer_size
    data = np.random.bytes(size)
    n_rounds = 10
    elapsed = 0
    for round in range(n_rounds):
        sha = hashlib.sha256()
        start = time.time()
        for i in range(0, size, buffer_size):
            sha.update(data[i:i+buffer_size])
        sha.digest()
        end = time.time()
        elapsed += (start - end)
    return elapsed / n_rounds

A note

  • The size is computed in that strange way in order have all full buffers. The Java MessageDigest’ update methods throws an exception if the offset used goes beyond the size of the byte array considered.
  • I chose to replicate the Java code in Python (that is why I did not use timeit or something like that).

Results

I tested both the implementation on different array sizes (1MB, 10MB, 100MB, 1GB) and using different sizes for the buffer (128byte, 1KB, 4KB, 64KB). The results obtained are quite impressive and shown in the following image.

  • The computational time increases linearly with the size of the array: this is true both for Python and for Java.
  • Java performance is completely unaffected by the size of the buffer considered
  • On the contrary, Python code strongly benefits from big buffers. Changing the buffer size from 128byte to 64KB reduced the computational time by approximately 40%
  • In all the tests done, Python was always faster than Java.
  • Using 64KB buffers, Python was twice as fast as Java.

Second: comparing Java and Python loop

Are Python’s for loop less efficient than Java’s? This could explain why Python code improved with the increase of the buffer size. I benchmarked a simple and plain for loop (I know, there are more efficient way of looping in Python) doing nothing and these are the results obtained:

As expected, Java is waaay faster than Python (again, I know that there are more efficient way of looping in Python, but let’s stick with it).

This could explain why the Python code presented before perfomed better with larger buffers.

Third: evaluating hashing alone

To evaluate the implementation of the hashing functions in isolation, I decided to test them without any buffering or chunking. This is the code I use to benchmark the two languages:

public static void comparisonAlone(int size){
        byte data[] = new byte[size];
        new Random().nextBytes(data);
        int nRounds = 10;
        double total = 0;
        for(int round = 0; round < nRounds; round++) {
            MessageDigest sha = MessageDigest.getInstance("SHA-256");
            long start = System.nanoTime();
            sha.digest(data);
            long end = System.nanoTime();
            total += ((end - start) / 1000000000.);
        }
        System.out.println(total / nRounds);
}
def comparison_alone(size):
    data = np.random.bytes(size)
    n_rounds = 10
    total = 0
    for _ in range(n_rounds):
        sha = hashlib.sha256()
        start = time.time()
        sha.update(data)
        sha.digest()
        end = time.time()
        total += (end - start)
    print(total / n_rounds)

and these are the results obtained:

Conclusion

  • Yes, at least on my Mac machine, Python hashlib is twice as fast as Java MessageDigest.
  • Python loops are slow. So prefer big chunks/buffers to big loops.

So if you need to perform lots of hashing on big files, probably Python is a good choice to consider.

Test setting

To perform these tests I used my Mac Pro (OS 10.13.4) with 22GB of RAM.

comments powered by Disqus