Enum vs Stream

November 15, 2019

The Enum and Stream modules have some of the same functions like each/2, map/2, filter/2 and more. What’s the difference? To help see the difference, let’s look at a simple example. First we’ll look at how Enum works and then compare it to Stream.

Contents

Enum and piping a list

When we pipe a list using the Enum.map/2 function, we can peek into the process by using IO.inspect/2 to see the order of execution and what each step returns.

This example uses IO.inspect/2 inside the functions being executed and between each step. With this we can better visualize what is happening.

defmodule Playing do
  def peeking_into_enum() do
    [1, 2, 3, 4, 5]
    |> IO.inspect(label: "ORIGINAL DATA")
    |> Enum.map(fn(num) -> IO.inspect(num + 10) end)
    |> IO.inspect(label: "STEP 1 RESULT")
    |> Enum.map(fn(num) -> IO.inspect(num * 2) end)
    |> IO.inspect(label: "STEP 2 RESULT")
    |> Enum.map(fn(num) -> IO.inspect(to_string(num)) end)
    |> IO.inspect(label: "STEP 3 RESULT")
  end
end
Playing.peeking_into_enum
#=> ORIGINAL DATA: [1, 2, 3, 4, 5]
#=> 11
#=> 12
#=> 13
#=> 14
#=> 15
#=> STEP 1 RESULT: [11, 12, 13, 14, 15]
#=> 22
#=> 24
#=> 26
#=> 28
#=> 30
#=> STEP 2 RESULT: [22, 24, 26, 28, 30]
#=> "22"
#=> "24"
#=> "26"
#=> "28"
#=> "30"
#=> STEP 3 RESULT: ["22", "24", "26", "28", "30"]

This makes it clear that it works probably exactly like you expected. It visits each element in the list and runs the function on the value. At the end of each step, we have a new list with the transformation applied.

This is called an eager evaluation.

In eager evaluation, an expression is evaluated as soon as it is bound to a variable.
https://en.wikipedia.org/wiki/Eager_evaluation

This was probably the behavior you expected because it is common in most traditional programming languages.

When is “eager” a problem?

Eager evaluation is the default approach used in Elixir. Why would I want a different strategy? When would “eager” be a problem?

Eager evaluation causes problems when the data is very large, possibly even unbounded. Using Enum for each step that we evaluate makes everything happen in RAM. When working with large data sets this can be a problem.

The Stream module gives us an elegant way to do lazy evaluations.

Lazy evaluation, or call-by-need is an evaluation strategy which delays the evaluation of an expression until its value is needed.
https://en.wikipedia.org/wiki/Lazy_evaluation

Eager evaluation causes problems when the data is very large, possibly even unbounded.

Stream and piping a list

Let’s adapt the Enum code from above to now use Stream and see the difference.

defmodule Playing do
  def peeking_into_stream() do
    [1, 2, 3, 4, 5]
    |> IO.inspect(label: "ORIGINAL DATA")
    |> Stream.map(fn(num) -> IO.inspect(num + 10) end)
    |> IO.inspect(label: "STEP 1 RESULT")
    |> Stream.map(fn(num) -> IO.inspect(num * 2) end)
    |> IO.inspect(label: "STEP 2 RESULT")
    |> Stream.map(fn(num) -> IO.inspect(to_string(num)) end)
    |> IO.inspect(label: "STEP 3 RESULT")
  end
end
Playing.peeking_into_stream
#=> ORIGINAL DATA: [1, 2, 3, 4, 5]
#=> STEP 1 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>]
#=> ]>
#=> STEP 2 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>]
#=> ]>
#=> STEP 3 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>]
#=> ]>

Wow. That output looks really different! A big thing to note is it hasn’t actually executed any of the functions!

At each step, instead of returning a transformed list where each element had the function applied to it, we get a data structure.

The Stream data structure contains the thing to enumerate as enum and it builds up a list of functions that we want applied in funs. Notice that at each step, the only thing that changes is that another function was added to the list.

How do we make a Stream actually do work?

The Stream data structure also implements the Enumerable protocol. This means we can use the Enum module to use our lazy definition and actually apply some demand to the stream making it evaluate. To do this, we can add one more pipe that pipes our stream into Enum.to_list/1.

defmodule Playing do
  def peeking_into_stream() do
    [1, 2, 3, 4, 5]
    |> IO.inspect(label: "ORIGINAL DATA")
    |> Stream.map(fn(num) -> IO.inspect(num + 10) end)
    |> IO.inspect(label: "STEP 1 RESULT")
    |> Stream.map(fn(num) -> IO.inspect(num * 2) end)
    |> IO.inspect(label: "STEP 2 RESULT")
    |> Stream.map(fn(num) -> IO.inspect(to_string(num)) end)
    |> IO.inspect(label: "STEP 3 RESULT")
    |> Enum.to_list()
  end
end
Playing.peeking_into_stream
#=> ORIGINAL DATA: [1, 2, 3, 4, 5]
#=> STEP 1 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>]
#=> ]>
#=> STEP 2 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>]
#=> ]>
#=> STEP 3 RESULT: #Stream<[
#=>   enum: [1, 2, 3, 4, 5],
#=>   funs: [#Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>,
#=>    #Function<48.51129937/1 in Stream.map/2>]
#=> ]>
#=> 11
#=> 22
#=> "22"
#=> 12
#=> 24
#=> "24"
#=> 13
#=> 26
#=> "26"
#=> 14
#=> 28
#=> "28"
#=> 15
#=> 30
#=> "30"
#=> ["22", "24", "26", "28", "30"]

By adding a call to an Enum function, it forced our stream to evaluate. Notice that it performed all of the functions in the sequence to the first element in the list before moving to the second element in the list.

Here’s a different way to visualize what just happened.

Each element in the list is piped through the sequence of functions we defined in our stream and the final value is used in the resulting list. Using this approach we never create the intermediate lists. Only 1 new list containing the final values is created.

When working with very large data sets, this can make a big difference to the memory consumption of your application!

When to use Enum vs Stream?

An obvious question to ask is “when do I choose one approach over the other?” Unfortunately, the answer isn’t always obvious. There are some clear occasions where Stream is the best option. Other times Enum performs better. There are plenty of scenarios where they are so similar that it doesn’t matter. There isn’t an absolute rule to follow here. As you play with it, you get a feel for it.

These next exercises aren’t problems to “solve”. They are opportunities to “play”. I setup some scenarios that give you a chance to play in IEx and develop your own feel for how these approaches compare. I’ll suggest some things to try and experiments to run. Feel free to tweak and run your own experiments as well!

Play time!

The playground equipment

First, let’s make sure you are comfortable with the playground equipment we will use here.

Location of the playground

Our playground is located in the lib/streams.ex file. Open it in your editor and look around.

Using IEx on the project

In a terminal window in the directory of the downloaded project file, enter the following:

iex -S mix

This starts an IEx session and loads the mix project into it so all the code is available to play with.

Executing a function and giving it some initial data looks like this:

CodeFlow.Streams.experiment_1_enum([1, 2, 3, 4, 5])

CodeFlow.Streams.experiment_1_enum(1..1_000)

Tweaking the code

As you are playing and experimenting, you will likely have an “I wonder…” moment and want to tweak the code. Feel free! That’s what this is for!

Rather than killing and restarting the IEx session, you can trigger Elixir to pick up your code changes and recompile it while your IEx session is still running. Use the recompile function.

recompile

Try it out. When you run recompile and no code changes were made, it returns a :noop. Meaning “no operation” was performed. If you made a code change, it returns :ok. Using recompile keeps you in the flow and experimenting.

Measuring the experiments

Let’s take a look at a simple experiment to see what it’s made up of.

  def experiment_1_enum(data) do
    simple_measurements(fn ->
      data
      |> Enum.map(&(&1 * 2))
      |> Enum.map(&(&1 + 1))
      |> Enum.map(&(&1 + 2))
      |> Enum.map(&(&1 + 3))
      |> Enum.map(&(&1 + 4))
      |> Enum.map(&(&1 + 5))
      |> Enum.map(&(&1 + 6))
      |> Enum.map(&(&1 + 7))
      |> Enum.map(&(&1 + 8))
      |> Enum.map(&(&1 + 9))
      |> Enum.map(&(&1 - 10))
      |> Enum.to_list()
    end)
  end

This defines a series of Enum.map/2 functions. Why so many? Each call to Enum.map/2 creates an intermediate list with the results for that step. By having so many it helps exaggerate the differences. Tweaking that can be part of your experiments!

Note the experimental code is wrapped inside the simple_measurements function. An anonymous function passes in the experiment to run. The simple_measurements function does the following things for us:

Forces a system-wide garbage collection to give us a standard baseline
Prints out the amount of memory our process is consuming before running the experiment
Tracks the time at start
Runs the anonymous function that performs our experiment
Tracks the time at stop
Prints out the amount of memory our process is consuming after running the experiment
Prints out the elapsed time in milliseconds

Here’s an example of the output.

CodeFlow.Streams.experiment_1_enum([1, 2, 3])
#=> 0.01 MB
#=> 0.01 MB
#=> 0 msec
#=> :ok

Executing the function runs the experiment. I passed in a simple 3 element list of [1, 2, 3]. It printed out the starting RAM and ending RAM followed by the elapsed time. With such a small list, it doesn’t really even register. The experiments are designed to make it easy for you to play with different sizes of data and see the impact.

Also note that these are not proper scientific benchmarks. Run the same operation multiple times and you will see variations between the runs. The goal with this setup is to give you enough feedback from your experiments that you get a feel for how it behaves.

Now that you have been introduced to the playground, it’s time to try something!

Experiment #1

There are two functions for experiment #1. An Enum and a Stream version. Try running them both and compare.

CodeFlow.Streams.experiment_1_enum(1..1_000)
#=> 0.01 MB
#=> 0.14 MB
#=> 1 msec
#=> :ok

CodeFlow.Streams.experiment_1_stream(1..1_000)
#=> 0.01 MB
#=> 0.08 MB
#=> 1 msec
#=> :ok

With a 1,000 item list, they both perform very quickly. The Enum version creates intermediate lists with all those steps. The RAM difference is noticeable but minor.

What do you think? Does it matter at this point which approach you’d choose?

Now try it with a much larger lists. Try these out:

1..1_000_000
1..10_000_000

What differences did you observe? Notice that the elapsed time is about the same. What about the difference in RAM?

Experiment #2

In the previous experiment, our function returns a full list of however many items you said should be in the list. What if the work we do doesn’t return a list but a computed result? What does that do?

In experiment #2 we change the last line from Enum.to_list/1 to Enum.sum/1. Instead of returning the full list it sums all the numbers and returns the summed value.

Working with a list of 1,000 elements might look like this:

CodeFlow.Streams.experiment_2_enum(1..1_000)  
#=> 0.01 MB
#=> 0.05 MB
#=> 1 msec
#=> :ok

CodeFlow.Streams.experiment_2_stream(1..1_000)
#=> 0.01 MB
#=> 0.04 MB
#=> 1 msec
#=> :ok

Now try it again with very large lists. Some suggested sizes again.

1..1_000_000
1..10_000_000

How did returning a computed value differ from returning a list?

Try using a very small list [1, 2] and using IO.inspect/2 inside the Stream.map/2 calls to peer into what happens.

Reveal how to tweak the experiment if you’d like a shortcut.

Show experiment with IO.inspect

  def experiment_2_stream(data) do
    simple_measurements(fn ->
      data
      |> Stream.map(&IO.inspect(&1 * 2))
      |> Stream.map(&IO.inspect(&1 + 1))
      |> Stream.map(&IO.inspect(&1 + 2))
      |> Stream.map(&IO.inspect(&1 + 3))
      |> Stream.map(&IO.inspect(&1 + 4))
      |> Stream.map(&IO.inspect(&1 + 5))
      |> Stream.map(&IO.inspect(&1 + 6))
      |> Stream.map(&IO.inspect(&1 + 7))
      |> Stream.map(&IO.inspect(&1 + 8))
      |> Stream.map(&IO.inspect(&1 + 9))
      |> Stream.map(&IO.inspect(&1 - 10))
      |> Enum.sum()
    end)
  end

You should see that the Stream version never builds a list of the values at all! That’s how the memory difference can be so large when working with large lists.

Experiment #3

Here’s another function to play with. Enum.take/2 takes a desired number of elements from an enumerable and returns them in a list.

What happens when I process a very large list but only want the first 5 elements in the result? That’s the idea we play with in this experiment.

Try it for yourself!

CodeFlow.Streams.experiment_3_enum(1..10_000_000)

CodeFlow.Streams.experiment_3_stream(1..10_000_000)

Did you see a difference? Do you know why they behaved differently? Think about that for a minute before showing more explanation about it.

Show more explanation

Enum functions are eager. At every step they build a full list of the results. The last step is “take the first 5”. All the work creating the millions of list entries was completely wasted.

Stream functions are lazy. They only compute what is requested when it is requested. Nothing in the stream is evaluated until the last line when the first 5 are requested. So it only ever computes the first 5.

To see this in action, try this out in IEx.

defmodule Testing do
  def take_5() do
    1..10_000_000
    |> Stream.map(&IO.inspect(&1 * 2, label: "Num #{&1} * 2"))
    |> Enum.take(5)
  end
end
Testing.take_5()

It only performs the operation on the first 5 elements.

A Range is a stream too. It doesn’t expand to the full set of numbers when expressed like 1..100.

This also shows that a Stream can be stopped before visiting all the possible elements.

Experiment #4

The downloaded project includes a file in test/support/lorem.txt that contains 13MB of generated lorem-ipsum text.

The lorem.txt file gives us a chance to play with reading a large-ish file. Using the Enum approach, the entire file gets loaded into memory and operated on. Using the Stream approach we can read chunks of data or even just a line of text at a time.

This experiment takes each line of text and splits it into a list of words. It then gets a count of the number of words for each line. Finally, it sums the number of words on each line all together for a total number of words for the file.

Run both and see how they compare. Run it a few times until the times level out.

CodeFlow.Streams.experiment_4_enum()

CodeFlow.Streams.experiment_4_stream()

What did you see happen? Why do you think it came out that way?

Show more explanation

Your times will vary! This is how my computer performed each.

CodeFlow.Streams.experiment_4_enum()
#=> 0.01 MB
#=> Total words counted: 1954512
#=> 13.08 MB
#=> 108 msec
#=> :ok

CodeFlow.Streams.experiment_4_stream()
#=> 0.01 MB
#=> Total words counted: 1954512
#=> 0.03 MB
#=> 219 msec
#=> :ok

Notice that the Stream version was 2x slower. Also notice that the Enum version used RAM is much larger. The exact size varies with the version of Erlang and your platform. It loads the full file into memory and operates on it there. The Stream version used very little memory.

Why is the Stream version slower? Disk access is slower than memory access. The time it takes to read from the disk (even when the OS caches it) is slower than RAM.

Which is better? It depends.

Which is better when you need to process 100 files all 5K in size? What if the files were all 250GB in size? Do you care more about processing multiple files in parallel or the raw speed on a single file at a time?

Which is better depends on the problem you are solving and how it will be used.

Built-in ways to start a stream

After playing with the experiments, hopefully you have a better sense of when to use Enum vs Stream. The next question you might ask is “What can be a stream”? Elixir’s standard library comes with some built-in functions for creating streams.

Here are few built-in ways to create a stream without using the Stream module.

File.stream!/3 – Returns a stream for reading a file. Used in Experiment #4.
IO.stream/2 – Converts an IO device into a stream.
Task.async_stream/5 – Spawns a concurrent Task to process elements in the stream. Options let you tune how many concurrent tasks will run, etc. A good option when the task being performed is expensive and running in parallel makes sense.
Ecto.Repo.stream/2 – The Ecto database library can stream query results. Works well for processing potentially very large result sets.

Anything can be a stream

Using Stream.resource/3, potentially anything can become a stream. The resource/3 function takes in 3 functions to do the following:

Setup the resource
Get the next value from it
Close or cleanup the resource

Examples of what you could use this to do:

Fetch and process pages of JSON data from an external service. The “next” function can fetch the next page of data.
Lazily parse a large CSV file – which is what the NimbleCSV library does.
Stream and process very large files over the web – as this blog series demonstrates.

Recap

You took a different approach here. Instead of making failing tests pass, you spent time playing and experimenting with Enum and Stream to get a feel for how a Stream is different.

In many ways, a Stream acts like Enum. Both are enumerable, have functions for processing data, and more. However a Stream is different because it uses lazy evaluation.

You may not need to use a Stream often, but knowing what it is, how to use it, why it is different, and getting a feel for the kinds of problems it helps solve was our goal.

Consider using a Stream when you are working with data that would otherwise consume a lot of memory to process. Examples are:

large lists
large files
large database result sets
processing a potentially unbounded source