How to use random.shuffle() on a generator? python

Question

How do I use random.shuffle() on a generator without initializing a list from the generator? Is that even possible? if not, how else should I use random.shuffle() on my list?

>>> import random
>>> random.seed(2)
>>> x = [1,2,3,4,5,6,7,8,9]
>>> def yielding(ls):
...     for i in ls:
...             yield i
... 
>>> for i in random.shuffle(yielding(x)):
...     print i
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/random.py", line 287, in shuffle
    for i in reversed(xrange(1, len(x))):
TypeError: object of type 'generator' has no len()

Note: random.seed() was designed such that it returns the same output after each script run?

that does not really make sense, as the point of a generator is that you don't know what are the elements and can't access them but in an orderly fashion — njzk2, Commented Jan 17, 2014 at 13:25
because the seed is supposed to be customized so in this case: n=2; random.seed(2). Sometimes the random seed could be other number. — alvas, Commented Jan 17, 2014 at 13:26
Can't imagine any canonique method to shuffle a sequence of unknown length. And note, that random.shuffle shuffles in place. — alko, Commented Jan 17, 2014 at 13:28
Instead of a whole generator function, you could have used iter(x). — Martijn Pieters, Commented Jan 17, 2014 at 13:31
I would suggest using a poisson distribution for a positive random look-ahead. Then (lazily or not) ignore that element from the iterated object, then repeat. — mnish, Commented May 27, 2018 at 5:40

Martijn Pieters · Accepted Answer · 2022-04-10 09:12:47Z

41

In order to shuffle the sequence uniformly, random.shuffle() needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:

lst = list(yielding(x))
random.shuffle(lst)
for i in lst:
    print i

You could, instead, use sorted() with random.random() as the key:

for i in sorted(yielding(x), key=lambda k: random.random()):
    print(i)

but since this also produces a list, there is little point in going this route.

Demo:

>>> import random
>>> x = [1,2,3,4,5,6,7,8,9]
>>> sorted(iter(x), key=lambda k: random.random())
[9, 7, 3, 2, 5, 4, 6, 1, 8]

edited Apr 10, 2022 at 9:12

answered Jan 17, 2014 at 13:26

Martijn Pieters

1.1m324 gold badges4.2k silver badges3.4k bronze badges

1

@thefourtheye: no, it just sorts the output of the yielding(x) generator using random values.
– Martijn Pieters
Commented Jan 17, 2014 at 13:29
1

I would have expected sorted to rely on the key function being deterministic (if not, there is probably a caching mechanism (which, on second though, makes sense, given that sorted does not know the complexity of the key function))
– njzk2
Commented Jan 17, 2014 at 13:30
1

The first thing sorted() does is storing all elements in the generator in a list, before even starting to compute the keys and sorting it.
– Sven Marnach
Commented Jan 17, 2014 at 13:31
1

@njzk2: no, the key function is called only once for each value.
– Martijn Pieters
Commented Jan 17, 2014 at 13:32
1

@njzk2: the key function is used in a decorate-sort-undecorate style sort; sorting a list of (keyfunc(value), None, value) values, then extracting value from that again.
– Martijn Pieters
Commented Jan 17, 2014 at 13:33

| Show 12 more comments

sturgemeister · Accepted Answer · 2019-02-13 04:07:38Z

Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.

Alternatively... If you need to use the generator...

it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:

from itertools import islice
import numpy as np
def shuffle(generator, buffer_size):
    while True:
        buffer = list(islice(generator, buffer_size))
        if len(buffer) == 0:
            break
        np.random.shuffle(buffer)
        for item in buffer:
            yield item
shuffled_generator = shuffle(my_generator, 256)

This will shuffle data in chunks of buffer_size, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.

i took your answer and improved it a bit... see alternative below, streamed not chunked. that way elements can migrate far beyond the buffer_size window. — Erik Aronesty, Commented Aug 28, 2019 at 18:37

Aaron Digulla · Accepted Answer · 2014-01-17 13:48:36Z

5

It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:

tmp = list(yielding(x))
random.shuffle(tmp)
for i in tmp:
    print i

Note the call to list() which will read all items and put them into a list.

If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.

edited Jan 17, 2014 at 13:48

answered Jan 17, 2014 at 13:26

Aaron Digulla

330k110 gold badges624 silver badges837 bronze badges

1

+1 for "change the generator to yield in a random order." This is what I did when I had OP's problem. You can add an optional bool param: def yielding(ls, randomize=False).
– Taylor Vance
Commented Jul 6, 2022 at 16:07

Add a comment |

Erik Aronesty · Accepted Answer · 2019-08-28 18:33:53Z

You could sample from arbitrary yielded results, generating a not fully random but somewhat shuffled set within a range. Similar to @sturgemeister code above, but not chunked.... there are no defined randomness boundaries.

For example:

def scramble(gen, buffer_size):
    buf = []
    i = iter(gen)
    while True:
        try:
            e = next(i)
            buf.append(e)
            if len(buf) >= buffer_size:
                choice = random.randint(0, len(buf)-1)
                buf[-1],buf[choice] = buf[choice],buf[-1]
                yield buf.pop()
        except StopIteration:
            random.shuffle(buf)
            yield from buf
            return

The results should be fully random within the buffer_size window:

for e in scramble(itertools.count(start=0, step=1), 1000):
    print(e)

For an arbitrary 1000 elements in this stream... they are random seeming. But looking at the overall trend (beyond 1000), it's clearly increasing.

To test, assert that this returns 1000 unique elements:

for e in scramble(range(1000), 100):
    print(e)

Nice implementation and decent performance, in my tests around 10X slower than iterating over a standard list() object. I also wasn't aware of the functionality of yield from, interesting! — Scholar, Commented Oct 8, 2020 at 8:27

beiller · Accepted Answer · 2018-09-15 03:34:13Z

I needed to find a solution to this problem so I could get expensive to compute elements in a shuffled order, without wasting computation by generating values. This is what I have come up with for your example. It involves making another function to index the first array.

You will need numpy installed

pip install numpy

The Code:

import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
def shuffle_generator(lst):
    return (lst[idx] for idx in np.random.permutation(len(lst)))
def yielding(ls):
    for i in ls:
        yield i
# for i in random.shuffle(yielding(x)):
#    print i
for i in yielding(shuffle_generator(x)):
    print(i)

shacharf · Accepted Answer · 2020-10-14 12:03:18Z

For very large sequences, if you know the sequence size in advance:

class subset_iterator:
    """
    an iterator class that returns K random samples from another sequence
    that has no random-access. Requires: the sequence length as input
    similar to random.sample
    :param it: iterator to the sequence
    :param seqlen: size of the sequence of :param it:
    :param K: output sequence size (number of samples in the subset)
    """
    def __init__(self, it, seqlen, K):
        self.it = it
        self.N = seqlen
        self.K = K
    def __iter__(self):
        return self
    def __next__(self):
        while True:
            r = random()
            nextitem = next(self.it)
            if r <= float(self.K) / self.N:
                self.K -= 1
                self.N -= 1
                return nextitem
            else:
                self.N -= 1

0-_-0 · Accepted Answer · 2021-12-12 21:40:41Z

A generator follows a sequential access pattern. Shuffled data requires the exact opposite, a random access pattern.

In many applications, we can get away with local perturbations only, which relaxes the problem quite a lot.

Here is an example of an in memory shuffle buffer.

from random import randint
domain = (0, 1000)
buffer = [randint(*domain) for _ in range(50)]
for element in range(*domain):
    idx = randint(0, len(buffer)-1)
    element, buffer[idx] = buffer[idx], element
    print(element)

Collectives™ on Stack Overflow

How to use random.shuffle() on a generator? python

7 Answers 7

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Your Answer

Sign up or log in

Post as a guest

Linked

Related