Stefano Cappellini

AI, Deep Learning, Machine Learning, Software Engineering

How to (efficiently) generate a huge number of pseudorandom bytes using Python

Written on , in: ,

Last modified:

Yesterday I was trying to benchmark some Python hashlib hash functions and so I needed a way to generate some dummy, large contentent. To speed up the process I decided to work directly with bytes, avoiding expensive conversions. And there is where the fun began! How do you (efficiently) generate a huge number of pseudorandom bytes?

Table of contents


  • 13 January 2019:
    • Corrected a subtle bug that could have affected the benchmarks performed for the solutions 3, 10 and 11
    • Changed the benchmark procedure
    • Switched from TensorFlow 1.9 to 1.10.1
    • Switched from Numpy 1.4.5 to 1.15.1
    • Repeated all the benchmarks

Quick answer

If you are looking for a quick answer, here it is: the best (fastest) way to generate a lot of pseudorandom bytes using Python (the fastest without the needs for ad hoc solutions or custom/complex code) is to use the numpy function random.bytes, like this:

import numpy as np
def generate(number_of_bytes):
    return np.random.bytes(number_of_bytes)

The code

Below you can find the code I used and the various solutions I tried. The code is also available here. As you can see:

  • I tried 11 different solutions/implementations
  • Some were inspired by this gorgeous SO post (worth reading). Thanks a lot for the inspiration!
  • Benchmark strategy: Each solution was benchmarked using the timeit module on ten different iterations (in order to get more stable results).For each solution, the average required time, experessed as seconds per iteration, is reported. This is obtained by dividing the overall time (that is, the time required to complete all the 10 iterations) by 10. (Update 10 January 2019): Each solution was benchmarked using the timeitmodule (in particular, using the repeat function of a timeit.Timer object) in the following way:
    • The solution was firstly benchmarked on ten different iterations, obtaining the overall execution time. That score was then divided by 10 in order to get the average execution time (expressed as seconds per iteration)
    • This process was then repeated ten times, obtaining a vector of ten average execution times
    • The minimum of that vector is the score reported

Enough talking, let’s dive into the code.

import os
import random
import timeit
import numpy as np
from Crypto.Cipher import AES
from subprocess import check_output
import tensorflow as tf

def generate_one(size):
    return bytes((random.randint(0, 255) for _ in range(size)))

def generate_two(size):
    return bytes((np.random.randint(0, 256) for _ in range(size)))

def generate_three(size):
    return bytes(np.random.randint(0, 256, size, dtype=np.uint8))

def generate_four(size):
     return np.random.bytes(size)

def generate_five(size):
    return os.urandom(size)

def generate_six(size):
    return bytes((random.getrandbits(8) for _ in range(size)))

# Same as 6, optimized loop
def generate_seven(size):
    return bytes(map(random.getrandbits, (8,) * size))

def generate_eight(size):
    return check_output(['openssl', 'rand',  str(size)])

def generate_nine(size):
    enc ="secretkeysecretk", AES.MODE_OFB, b'a' * 16)
    return enc.encrypt(b' ' * size)

# TensorFlow GPU
def generate_ten(size):
    x = tf.random_uniform((size,), 0, 256, dtype=tf.int32)
    return bytes(tf.Session().run(x).astype(np.uint8))

# TensorFlow CPU
def generate_eleven(size):
    x = tf.random_uniform((size,), 0, 256, dtype=tf.int32)
    config = tf.ConfigProto(
        device_count = {'GPU': 0}
    return bytes(tf.Session(config=config).run(x).astype(np.uint8))

# Code used to benchmark the various solutions
def benchmark_code(fn, size, iterations=10):
    timer = timeit.Timer(lambda: fn(size))
    return min(timer.repeat(10, iterations))


To perform the benchmark I used my Ubuntu (16.04 LTS) PC:

  • 32 GB RAM
  • GTX 1080 Ti
  • Scipy 1.1
  • Numpy 1.4.5 1.15.1
  • TensorFlow 1.9 1.10.1
  • Python 3.6.4


These are the results I obtained generating 1048576 bytes (1MB).

SolutionAvg s/itMB/s

I then deciced to test the same solutions (using the same approach as before) even further, generating 1GB of data. These are the results I got:

SolutionAvg s/itMB/s

It is important to notice that this time I did not consider the solutions generate_one and generate_two: they were simply hopeless! As you can see, the rates dropped but the ordering remains almost the same. The only big difference regards TensorFlow: with lots of data, it starts pushing and it obtained very good performance.

Update (13 January 2019)

I decided to test two top solutions in the 1GB benchmark even further, generating truly huge amounts of data. In particular, the two solutions (generate_four and generate_ten) were used to produce 1GB, 5GB and 7GB of data. The results obtained are presented below, as always expressed in average seconds per iteration.

1GB1.282772 s/it1.321982 s/it
5GB6.410168 s/it13.701987 s/it
7GB8.981011 s/it42.629495 s/it

We can clearly see that the true winner is, again, the solution number four, generate_four. The solution using TensorFlow, in particular the one exploiting the GPU, simply exploded soon after the 1GB of data. I do not know why, probably it is due to the limited memory of the GPU used for these benchmarks (11GB vs 32GB of RAM available).

Plots of the results


So, what’s the best (fastest) way to generate lots of pseudorandom bytes in Python?

  • Best overall solution: solution_four, the one using the numpy.random.bytes function, is the clear winner: it outperformed all the other solutions. This is the best choice.
  • Best plain Python solution: solution_five, the one using the os.urandom function, is a very good choice.

Further considerations

  • I repeated these benchmarks on multiple UNIX machines (both Linux and Mac) and I obtained almost the same ranking that is, indeed, very stable (obviously the timing I got were quite different because they are machine dependent).
  • Performance of solution 5, the one using os.urandom tends to vary a lot: on my Ubuntu 16.04 LTS it performed very well, while on another 17.04 Ubuntu machine and on Macs it was a lot slower! I do not know why, maybe it is due to how the operating systems manages the /dev/urandom special file.
  • As said before, I benchmarked solution_one and solution_two only up to 1MB of data. After that value, a test would have been simply useless.
  • (Update 13 January 2019) TensorFlow 1.10.1 is faster than 1.9 (used in a previous version of this post) - at least on my machine - in creating huge amounts of data (1GB). I do not know if it is due to the bugfix, to the newer version of Numpy I installed or to TensorFlow itself. I should investigate further
  • (Update 13 January 2019) On the contrary, with smaller amount of data (~ 1MB), TensorFlow 1.9 was faster than the 1.10.1 version
  • (Update 13 January 2019) Forcing TensorFlow to use only the CPU made it running faster than its GPU counterpart (generate_eleven vs generate_ten) in generating small amount of data (~ 1MB). It’s only with very big amount of data (1GB) that the GPU code starts to shine.
  • (Update 13 January 2019) I considered also the function token_bytes made available by the Python secrets module. However, this internally makes use of the os.urandom function (see here), already considered in the solution generate_five, making it a complete duplicate: this is why it is not listed here.
  • By looking at the plots:
    • The solutions based on TensorFlow (10, 11) and the one using OpenSSL (8) tend to have a relatively high computational time when used to generate not so big amounts of data (see plot 3). This is probably due to the setup overhead (TensorFlow has to build the graph, OpenSSL is a separate shell utility)
    • If you have to generate small amounts of data (something around 10K elements), approaches 3,4,5,6,7,8 and 9 obtain comparable performance (see plot 5).


This post is far from being an objective and rigorous testing. It is just a review of some benchmark I did on my own. In particular:

  • Changing the architecture used for testing would probably change the order of the results
  • The numbers obtained have to be taken with care
comments powered by Disqus