26

How do I use random.shuffle() on a generator without initializing a list from the generator? Is that even possible? if not, how else should I use random.shuffle() on my list?

>>> import random
>>> random.seed(2)
>>> x = [1,2,3,4,5,6,7,8,9]
>>> def yielding(ls):
...     for i in ls:
...             yield i
... 
>>> for i in random.shuffle(yielding(x)):
...     print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/random.py", line 287, in shuffle
    for i in reversed(xrange(1, len(x))):
TypeError: object of type 'generator' has no len()

Note: random.seed() was designed such that it returns the same output after each script run?

6
  • 5
    that does not really make sense, as the point of a generator is that you don't know what are the elements and can't access them but in an orderly fashion
    – njzk2
    Commented Jan 17, 2014 at 13:25
  • because the seed is supposed to be customized so in this case: n=2; random.seed(2). Sometimes the random seed could be other number.
    – alvas
    Commented Jan 17, 2014 at 13:26
  • Can't imagine any canonique method to shuffle a sequence of unknown length. And note, that random.shuffle shuffles in place.
    – alko
    Commented Jan 17, 2014 at 13:28
  • 2
    Instead of a whole generator function, you could have used iter(x). Commented Jan 17, 2014 at 13:31
  • I would suggest using a poisson distribution for a positive random look-ahead. Then (lazily or not) ignore that element from the iterated object, then repeat.
    – mnish
    Commented May 27, 2018 at 5:40

7 Answers 7

41

In order to shuffle the sequence uniformly, random.shuffle() needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:

lst = list(yielding(x))
random.shuffle(lst)
for i in lst:
    print i

You could, instead, use sorted() with random.random() as the key:

for i in sorted(yielding(x), key=lambda k: random.random()):
    print(i)

but since this also produces a list, there is little point in going this route.

Demo:

>>> import random
>>> x = [1,2,3,4,5,6,7,8,9]
>>> sorted(iter(x), key=lambda k: random.random())
[9, 7, 3, 2, 5, 4, 6, 1, 8]
17
  • 1
    @thefourtheye: no, it just sorts the output of the yielding(x) generator using random values. Commented Jan 17, 2014 at 13:29
  • 1
    I would have expected sorted to rely on the key function being deterministic (if not, there is probably a caching mechanism (which, on second though, makes sense, given that sorted does not know the complexity of the key function))
    – njzk2
    Commented Jan 17, 2014 at 13:30
  • 1
    The first thing sorted() does is storing all elements in the generator in a list, before even starting to compute the keys and sorting it. Commented Jan 17, 2014 at 13:31
  • 1
    @njzk2: no, the key function is called only once for each value. Commented Jan 17, 2014 at 13:32
  • 1
    @njzk2: the key function is used in a decorate-sort-undecorate style sort; sorting a list of (keyfunc(value), None, value) values, then extracting value from that again. Commented Jan 17, 2014 at 13:33
6

Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.

Alternatively... If you need to use the generator...

it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:

from itertools import islice
import numpy as np
def shuffle(generator, buffer_size):
    while True:
        buffer = list(islice(generator, buffer_size))
        if len(buffer) == 0:
            break
        np.random.shuffle(buffer)
        for item in buffer:
            yield item
shuffled_generator = shuffle(my_generator, 256)

This will shuffle data in chunks of buffer_size, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.

1
  • i took your answer and improved it a bit... see alternative below, streamed not chunked. that way elements can migrate far beyond the buffer_size window. Commented Aug 28, 2019 at 18:37
5

It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:

tmp = list(yielding(x))
random.shuffle(tmp)
for i in tmp:
    print i

Note the call to list() which will read all items and put them into a list.

If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.

1
  • 1
    +1 for "change the generator to yield in a random order." This is what I did when I had OP's problem. You can add an optional bool param: def yielding(ls, randomize=False). Commented Jul 6, 2022 at 16:07
4

You could sample from arbitrary yielded results, generating a not fully random but somewhat shuffled set within a range. Similar to @sturgemeister code above, but not chunked.... there are no defined randomness boundaries.

For example:

def scramble(gen, buffer_size):
    buf = []
    i = iter(gen)
    while True:
        try:
            e = next(i)
            buf.append(e)
            if len(buf) >= buffer_size:
                choice = random.randint(0, len(buf)-1)
                buf[-1],buf[choice] = buf[choice],buf[-1]
                yield buf.pop()
        except StopIteration:
            random.shuffle(buf)
            yield from buf
            return

The results should be fully random within the buffer_size window:

for e in scramble(itertools.count(start=0, step=1), 1000):
    print(e)

For an arbitrary 1000 elements in this stream... they are random seeming. But looking at the overall trend (beyond 1000), it's clearly increasing.

To test, assert that this returns 1000 unique elements:

for e in scramble(range(1000), 100):
    print(e)
1
  • Nice implementation and decent performance, in my tests around 10X slower than iterating over a standard list() object. I also wasn't aware of the functionality of yield from, interesting!
    – Scholar
    Commented Oct 8, 2020 at 8:27
2

I needed to find a solution to this problem so I could get expensive to compute elements in a shuffled order, without wasting computation by generating values. This is what I have come up with for your example. It involves making another function to index the first array.

You will need numpy installed

pip install numpy

The Code:

import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
def shuffle_generator(lst):
    return (lst[idx] for idx in np.random.permutation(len(lst)))
def yielding(ls):
    for i in ls:
        yield i
# for i in random.shuffle(yielding(x)):
#    print i
for i in yielding(shuffle_generator(x)):
    print(i)
1

For very large sequences, if you know the sequence size in advance:

class subset_iterator:
    """
    an iterator class that returns K random samples from another sequence
    that has no random-access. Requires: the sequence length as input
    similar to random.sample
    :param it: iterator to the sequence
    :param seqlen: size of the sequence of :param it:
    :param K: output sequence size (number of samples in the subset)
    """
    def __init__(self, it, seqlen, K):
        self.it = it
        self.N = seqlen
        self.K = K
    def __iter__(self):
        return self
    def __next__(self):
        while True:
            r = random()
            nextitem = next(self.it)
            if r <= float(self.K) / self.N:
                self.K -= 1
                self.N -= 1
                return nextitem
            else:
                self.N -= 1
0

A generator follows a sequential access pattern. Shuffled data requires the exact opposite, a random access pattern.

In many applications, we can get away with local perturbations only, which relaxes the problem quite a lot.

Here is an example of an in memory shuffle buffer.

from random import randint
domain = (0, 1000)
buffer = [randint(*domain) for _ in range(50)]
for element in range(*domain):
    idx = randint(0, len(buffer)-1)
    element, buffer[idx] = buffer[idx], element
    print(element)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.