Data Streams
In Russian programming jargon, the word “поток” (potok) is used for several different concepts: thread, stream, and flow. This time we’ll skip threads and talk about streams and flows—that is, I/O streams and data flows (the literal translation of “data flow”). Accordingly, I’ll be using the word “поток” frequently below, specifically in that sense.
In tech, we increasingly have to analyze or process data streams on the fly. And that’s only becoming more common as the modern world generates staggering volumes of information—so much that entire compute clusters are built in data centers just to store and process it. Yes, those same stdin, stdout, and stderr you first learned about in school are tirelessly shuttling bits and bytes back and forth.
If we set aside the poetry and get practical, the simplest examples of data streams are network traffic, a sensor’s signal, or, say, real-time stock quotes. There’s already a whole and growing class of stream-processing algorithms for such tasks. When you type something like «cat file.txt | sed s/foo/bar/g» at the command line, you’re manipulating a data stream: the “pipeline” (the vertical bar) passes cat’s stdout into sed’s stdin.
This brings us to the concept of the iterator and what’s known as dataflow programming. To handle a continuous stream of incoming data, you break it into chunks and work on those chunks. That’s where the iterator comes in.
info
An iterator is an abstraction that lets you pull elements one by one from a source—anything from stdin to a large container—while only keeping track of the element it’s currently positioned at. In C/C++ terms, you can think of it as a pointer you move across a container. For these languages, I recommend watching Andrei Alexandrescu’s talk.
Iterating over items using an iterator
Now, onto Python. If the previous paragraphs felt obvious and dull, don’t worry—code is coming up very soon.
Let’s start with the terminology. In Python (and not only in Python), there are two terms that sound almost the same but mean different things: iterator and iterable. An iterator is an object that implements the interface described above, while an iterable is a container that can serve as a data source for an iterator.
You probably remember that for an instance of a class to be usable in a for loop, the class must implement the iterator protocol: iter() and next(). You can iterate over a list, but the list itself doesn’t track where you are in the traversal. That state is managed by a list iterator object, which is returned by iter() and consumed by constructs like the for loop or by functions such as map(). When the collection runs out of items, a StopIteration exception is raised.
In most cases—except where explicitly noted—I’ll try to write code that works with both Python 2 and Python 3. The six module will help with that. Six provides a base class so you don’t have to worry about which next method to implement on an object—it’s always __next__(
.
from __future__ import print_function
from six import Iterator
class MyIterator(Iterator):
def __init__(self, step=5):
self.step = step
def __iter__(self):
return self
def __next__(self):
self.step -= 1
# Iterator stop condition so it doesn't run forever
if not self.step:
raise StopIteration()
return self.step
myiterator = MyIterator()
for item in myiterator:
print(item, end=" ")
# 4 3 2 1
It’s pretty straightforward. The MyIterator class implements the interface described above for iterating (more precisely, generating) elements and raises an exception when the current step value reaches zero. Note its laziness: the iterator only does work when the for loop asks for the next item (i.e., on each iteration advance).
Syntactic Sugar
But do we really need to overengineer with classes when all we want is to iterate over a collection? Let’s remember list comprehensions—aka, in plain English, list comprehensions.
with open("file.txt", "r") as f:
mylist = [l for l in f if "foo" in l]
for item in mylist:
print(item)
# This prints all lines from file.txt that contain the substring "foo"
Now we’re getting somewhere. Three and a half lines of code (easily reducible to two, but I want to illustrate the idea) that build a list—and as we know, lists already come with an iterator. But there’s a small problem. Actually, a big one. More precisely, it depends on how many elements end up in the collection (in this case, how many lines in the file contain the substring “foo”, and therefore how many go into the list).
You didn’t forget that a Python list is a contiguous block of memory, similar to a C array, right? What’s more, when you append to the end (which is exactly what we’re doing), there’s a chance the new list won’t fit into the currently allocated block, and the interpreter will have to request a new block from the OS and copy all existing elements over—an O(n) operation. And if the file is large and there are lots of matching lines?
Bottom line: don’t use the code as written. So what should we do instead?
with open("file.txt", "r") as f:
mylist = (l for l in f if "foo" in l)
for item in mylist:
print(item)
# Here too, prints all lines from file.txt that contain the substring "foo"
Honestly, I was surprised to see, while reading the code of several large open-source projects, that people love list comprehensions but completely forget about generator expressions. To me, that’s like using range() instead of xrange() in Python 2 just to iterate over numbers—overlooking the fact that it gobbles up memory to store the entire sequence of results.
Let’s Generate Something Useful
So what exactly is a generator expression, and what is a generator, anyway? Put simply, a generator expression is another bit of syntactic sugar in Python—the simplest way to create an object with an iterator interface without loading all items into memory (which you usually don’t need). And as for a generator as a concept…
Take functions, for example: the only people who usually don’t know what they are are those who got their first programming book today. If it was yesterday, they probably do. But functions have strictly defined behavior—they have a single entry point and a single return value (what Python lets you do with “return a, b” isn’t true multiple return; it’s just returning a tuple). Now, what if I told you a generator is basically a function with multiple entry and exit points?
The key idea behind a generator is that, like an iterator, it remembers where it left off, but instead of dealing with abstract elements, it runs actual chunks of code. So while an iterator will, by default, walk through the items in a container until they’re exhausted, a generator keeps executing code until a specific return condition is met. In fact, the code shown in the first section is essentially a generator that exposes an iterator interface with a clearly defined exit condition. If you expand the generator expression from the snippet above into a full generator function, it would look roughly like this:
def my_generator(step=4):
with open("file.txt", "r") as f:
for l in f:
if "foo" in l:
yield l
myiterator = my_generator()
for item in myiterator:
print(item)
# And here too (what a surprise) prints all lines from file.txt that contain the substring "foo"
The yield keyword acts as a delimiter between chunks of code that the generator runs each time it’s advanced—that is, on each iteration. In fact, you don’t need a loop at all; you can just write several pieces of code in a function and separate them with yield statements. The interface stays exactly the same: it’s still iterator-based.
def my_generator2(step=4):
print("First block")
yield 1
print("Second block")
yield 2
myiterator = my_generator2()
myiterator.next()
# "First block"
myiterator.next()
# "Second block"
myiterator.next()
# Traceback: StopIteration
Yield Is Smarter Than You Think
Alright, we’ve basically covered multiple exit points. But what about multiple entry points? Is next() really all we can do with a generator? As it turns out, no.
Since the days of Python 2.5, generator objects gained a few more methods: .
, .
, and .
. This more or less sparked a revolution in dataflow programming. With .
you can externally tell a generator to stop the next time it’s advanced, and with .
you can force it to raise an exception:
from __future__ import print_function
import itertools
def my_generator3():
# Infinite counter
counter = itertools.count()
while True:
yield next(counter)
myiterator = my_generator3()
myiterator2 = my_generator3()
for item in myiterator:
print(item, end=" ")
if item == 3:
# Properly terminate the generator
myiterator.close()
for item in myiterator2:
print(item, end=" ")
if item == 2:
# This is bad
myiterator2.throw(Exception("Everything is bad"))
# 0 1 2 3
# 0 1 2 Traceback (most recent call last):
# File "test.py", line 28, in <module>
# myiterator2.throw(Exception("Everything is bad"))
# File "test.py", line 12, in my_generator3
# yield next(counter)
# Exception: Everything is bad
And the real treat, as it turns out, is hidden in the .
method. It lets you inject data into a generator before it runs the next chunk of code!
from __future__ import print_function
import itertools
def my_coroutine():
# Infinite counter
counter = itertools.count()
while True:
y = (yield next(counter))
if y:
# Change the starting point
counter = itertools.count(y)
myiterator = my_coroutine()
for item in myiterator:
print(item, end=" ")
if item == 1:
# Send a number to the generator
myiterator.send(4)
elif item == 6:
myiterator.close()
# 0 1 5 6
There’s a reason I called the generator here a coroutine. A coroutine is exactly that thing developers whisper about in meeting rooms when discussing gevent, Tornado, and eventlet. You can read more on Wikipedia, but I’ll focus on the fact that coroutines in this form are most often used in Python for stream processing, enabling cooperative multitasking.
The point is that when yield is called (as with return in the general case), control is transferred. A coroutine decides for itself when to hand off execution—say, to another coroutine. This lets you build elegant, branching trees for processing data streams, implement MapReduce, or even pipe the current bytes through a socket to another node. Moreover, coroutines can be implemented in virtually any language, much like the Linux command-line utilities I mentioned at the very beginning.
A Tiny Practical Example
For someone without prior exposure, code written in a coroutine style can be quite hard to read. It’s like a set of asymmetric gears that somehow take turns driving one another, all while preserving their last state.
In general, this “gear-driven” setup has a great advantage: any “gear” can be moved elsewhere or swapped out for another that does the job better. What I mean is that in Python, all you really need to know about an object is that it exposes an iterator interface the interpreter understands; the language it’s written in and what it does under the hood matter much less.
I’ll give an example from David Beazley’s presentation (see the sidebar). Here’s a sample Apache log entry:
23.34.139.80 – … “GET /categories/ HTTP/1.1” 200 6394
23.34.139.80 – … “GET /favicon.ico HTTP/1.1” 404 807
23.34.139.80 – … “GET /static/img/logo.gif HTTP/1.1” 200 41526
23.34.139.80 – … “GET /news/story.html HTTP/1.1” 200 6223
23.34.139.80 – … “GET /about/example.html HTTP/1.1” 200 1223
42.77.100.21 – … “GET /index.html HTTP/1.1” 200 7774
Let’s sum the last column to find out how many bytes were transferred over the network. I’ll leave implementing it with a regular loop as homework for you. Instead, we’ll jump straight to expressing it with generator expressions.
wwwlog = open("access-log")
bytecolumn = (line.rsplit(None,1)[1] for line in wwwlog)
bytes = (int(x) for x in bytecolumn if x != '-')
print "Total", sum(bytes)
It’s pretty obvious what each line does, isn’t it? If we need to filter specific lines or do some extra counting along the way, we can roll it all into a single pipeline—just add a few more generator stages in the right places, like meshing gears.
So what do we do with all that?
Understanding how iterators and generators work in programming languages is one of the first steps toward mastering the sequential processing of massive data streams. This includes fields like trading and technical analysis—areas where, in today’s world, many people make a fortune.
Even without aiming for the moon, your script will end up using several times fewer resources. The Python interpreter does try to avoid unnecessary data copying, but its hands are tied here: if the code is written that way, it has to materialize the entire list in memory.
All in all, wishing you fast, elegant code. See you around!