Arek Nawo

20 May 2019

17 min read

Node.js file streams explained!

In today’s post, we’re continuing the discovery of Node.js (v10.15.3 LTS) APIs! Last time, we’ve discussed the File System (FS) API used to read and write files, but not all of it. We haven’t yet talked about quite a few things, including streams, which present a great, alternative way of reading and writing data. Instead of doing everything at once (even if it’s done asynchronously), streaming data is much more efficient and performant way - especially when working with large files. Your files are read or written chunk by chunk, rather than all at once. They have a few FS API methods dedicated to them, but also a whole API of their own called Stream API. And it’s all that goodness that we’ll explore in this post!

Streams

Before going further into FS-related file streaming, we first should learn a bit more about Node.js Stream API itself. At its core, a Stream is an interface based on [EventEmitter](https://nodejs.org/dist/latest-v10.x/docs/api/events.html#events_class_eventemitter) class, that is implemented by multiple methods from across Node.js API. Examples of these include HTTP requests and here-mentioned File System operations. The EventEmitter on its own is a very simple class that many other entities use or inherit from. It’s responsible for listening and emitting events, with methods like .on() and .emit(). I think the documentation provides all the information in a clear and readable way.

Streams can be readable, writable or both. Most of the “stream” module API as a whole, is targeted towards creating implementations of Stream interface, which isn’t the focal point of this article. With that said, here, we’ll make a nice overview of readable and writable streams interfaces only, with “consumer use” in mind.

By default, streams operate only on strings and buffers, which happens to be the same form of data that we used to write and read files in the previous post. You can make your stream work with other types of values by setting objectMode property when creating a stream. A stream that is in “object mode” support all possible types of values, except null, which serves special purposes. This trick shouldn’t really be needed when working with FS API however.

createReadableStreamSomehow({ objectMode: true });

Readable

Readable streams are the ones from which data can be read. They’re defined by the stream.Readable class and operate in 2 different reading modes (not to be misunderstood with “object mode”). These are “flowing” and “paused”. All newly-created Streams are in “paused mode” by default and thus, they require the user to explicitly request another chunk of streamed data. “Flowing mode”, on the other hand, makes data “flow” automatically, with you just having to handle - consume or ignore - incoming data.

Buffering

Whatever the mode you’re streaming your data with is, it’ll first have to be buffered. For this purpose, readable streams internally use .readableBuffer property, whereas writable streams - .writableBuffer. The size limit for those buffers is determined by highWaterMark property passed to stream constructor’s config. It’s considered either as the highest number of bytes (16 KB by default) or the highest number of objects (if in “object mode” - 16 by default) stored.

createReadableStreamSomehow({ highWaterMark: 8192 });

Different kinds of streams handle buffering differently. In the case of readable streams, data is constantly read and placed in the buffer, until it reaches the provided limit. Then, data reading is stopped, until data inside the buffer will be consumed, effectively freeing it.

Pause mode

Consuming streamed data highly depends on your current reading mode. When in “paused mode” - the default one - you’ll have to manually request the next chunk of data. For that, you’ll have to use the .read() method. The whole data from the internal buffer will be returned, unless you pass an argument, specifying the size limit for data to read.

// ...
readable.read();

In “object mode”, a single object will always be returned, regardless of the size argument.

Switching

Switching from the “paused mode” doesn’t require much work. The simplest way to do it would be to add a handler for the “data” event. Other ways include calling .resume() method, which resumes the emission of “data” event, or by piping a writing stream (more on that later).

// ...
readable.on("data", dataChunk => {
    // code
});
// or
readable.resume();

If for whatever reason, you want to go back to the “paused mode”, you can do this in two ways. You can either use .pause() method to stop emitting “data” event, or, if you’ve previously used the .pipe() method, use the .unpipe() to… unpiped piped writing stream.

// ...
readable.pause();

There’s an event called “readable”, which, if listened to, can make you stuck in “paused mode” and thus, make calling methods like .pause() and .resume() useless. It’s emitted when the new portion of data is available to read from the buffer and before the stream’s ending, when read data will be equal to null. After the event handler is removed, everything comes back to normal.

// ...
const handler = () => {
  readable.read();
  // handle reading manually
}
readable.on("readable", handler);
readable.off("readable", handler);

Flowing mode

“Flowing mode” is definitely a bit more complex in its nature. Here, the .read() method is called automatically, leaving you only with consuming given data within the “data” event, emitted right after .read() call, with a fresh data chunk.

// ...
readable.on("data", dataChunk => {
    // code
});

Furthermore, “flowing mode” has a safeguard built-in, that prevents the data from being automatically read, if a proper handler isn’t available. So, only when you add your “data” event handler, data will start flowing. As mentioned earlier, this also makes a switch from “paused” to “flowing” mode take place. You still need to be cautious though! Calling .resume() method without “data” event handler, or removing the handler, won’t stop the reading process and will result in data loss!

Events

Beyond “readable” and “data” events, readable streams can emit 3 more - “end”, “close” and “error”. The “end” event is emitted when the stream ends and all data has been consumed.

// ...
readable.on("end", () => {
    console.log("Stream ended");
});

The “close” event is emitted when an underlying source has been closed. Examples of that include closing the underlying file descriptor with the fs.close() method, discussed in the previous article.

// ...
readable.on("close", () => {
    console.log("Stream ended");
});

Lastly, we have the “error” event, which is, quite frankly, emitted whenever some sort of an error happens. An error object will be passed to the callback function.

// ...
readable.on("error", err => {
    console.log(err);
});

Checks

To maintain the proper control of the stream, Node.js provides you with some additional methods and properties.

You can check if the stream is in “paused mode” by calling .isPaused() method.

// ...
readable.isPaused(); // false
readable.pause();
readable.isPaused(); // true

With our current knowledge, the output of the first .isPaused() check may surprise you. Why the readable stream isn’t paused if we haven’t yet added any “data” handler or called .resume()? The answer is that, internally, the operating mode we’re talking about is a bit more complex. What we’ve discussed is just an abstraction over the state of reading stream, dictated by internal .readableFlowing property which you shouldn’t mess with. It can have one of 3 values - null, true or false. And, while true and false can be somewhat compared to our “paused” and “flowing” mode, null cannot. So, as the internal state is null just after the stream is created (it can be changed later by likes of .pause() or “data” event handlers), it’s not paused. That’s why the first invoke of .isPaused() returns false.

The official Node.js documentation provides you with 3 more metadata properties. .readable informs you if .read() can be called safely (in Node.js code it’s documented as a legacy feature though), .readableHighWaterMark provides you with your buffer size limit, and .readableLength indicates the current buffer size. Both of these can indicate the number of bytes or the number of objects, depending on whether “object mode” is turned on. Of course, Stream instances have a lot more internal properties you can access, but, unless you’re a creating your own Stream implementation, you shouldn’t really do, or even need to do this.

// ...
readable.readable; // true
readable.readableHighWaterMark; // 16384 by default
readable.readableLength; // number of bytes currently in buffer

Changes

Interaction with readable streams, apart from a standard workflow, is kind-of limited. This isn’t an issue though, as streams don’t really require much of that stuff.

.destroy() method does exactly what its name indicates - it destroys the stream, releasing internal resources (buffered data) and emitting “error” and “close” events. You can optionally pass an error object, that will be retrieved later in an “error” event handler.

// ...
readable.destroy();

With the .setEncoding() method you can change the encoding in which your data is read. By default, it’s equal to “buffer”. We’ve discussed encodings a bit deeper in the previous post.

// ...
readable.setEncoding("utf8");

Know that most stream implementations allow passing a config object that can be provided with encoding property, effectively setting it right from the start.

In scenarios, where you don’t want to consume all the streamed data linearly but in some different way, the .unshift() method may prove to be helpful. It literally puts the retrieved chunk of data back into the internal buffer. It can be called at any time, except after the “end” event. Still, you need to remember that when .unshift() is done, your data will be back inside your internal buffer, ready to be read again, with the first upcoming .read() call.

// ...
readable.setEncoding("utf8");

readable.on("readable", () => {
  let data = readable.read();
  
  // Let's say our streamed data is a string - "Hello World!";
  while (data === "Hello World!") {
    // Infinite loop!
    readable.unshift(data);
    data = readable.read();
  }
});

Piping

The process of piping brings us into the writable streams territory. All things that the .pipe() method does is simply piping (passing or connecting) the readable stream to the writable one. In this way, you can e.g. transfer the data from one file to another with ease!

const readable = createReadableStreamSomehow();
const writable = createWritableStreamSomehow();

readable.pipe(writable);

Like I’ve mentioned earlier when talking about operation modes, the .pipe() method automatically switches the readable stream to “flowing mode”. It also seamlessly manages the data flow and, in the end, returns the passed writable stream. In this way, you can use bidirectional streams (not discussed in this article), like ones implemented by Node.js ZLIB (compression), to create chainable, continuous flow.

The .pipe() method automatically closes the writable stream (no more data can be written), when “end” event from readable stream happens. You can change this behavior by passing an optional config object with end property in the form of boolean.

// ...
readable.pipe(writable, {end: false});

If you want to detach the piped stream(s), you can easily call .unpipe() method to do that. It detaches all piped streams if no writable stream is passed, or only the provided one otherwise. If the operating mode was set through the use of the .pipe() method, it will go back to the previous “paused mode”.

Writable

Even if a writable stream may seem to serve a bit more complex task of writing data, have a much simpler API. It favors the use of methods over events, but generally is quite similar to what we’ve seen with readable streams. There’s also no complex concepts of operation modes and all that stuff. Generally, it shouldn’t be hard for you to learn writable streams if you already know how to use the readable ones.

const writable = createWritableStreamSomehow();

Buffering

As writing is much different from reading, the buffering process is different too! In writable streams, every time you call .write() method, the data to be written is added to the buffer.

// ...
let bufferNotFull = writable.write("Hello World!", "utf8", () => {
    // code
});

The .write() method is pretty complex and can take 1 up to 3 arguments. The first should contain the data to be written - string or buffer. If it’s a string, then you can provide an optional second argument, indicating the encoding of the passed data, if you don’t want to use the default encoding of the given writable stream. Finally, you can pass a callback function to be invoked after data is written to the buffer.

The result of the .write() method will be a boolean, indicating whether there’s still some space left in the internal buffer. If it’s full (the return value is false) you should stop writing your data and wait for the “drain” event, to start writing again. Not following this practice can result in high memory usage, errors and thus, crashes.

// ...
writable.on("drain", () => {
    console.log("You can continue the writing process!");
});

Handling of .write() and “drain” event is done automatically and efficiently, when used through .pipe(). Thus, for more demanding scenarios, it’s recommended to wrap your data within a readable stream form if possible.

Similarities

Like I’ve mentioned earlier, writable streams share many similarities with readable ones. By now we know that there’s an internal buffer, which size can be set through the highWaterMark property of config object.

const writable = createWritableStreamSomehow({
    highWaterMark: true
});

Writable stream object config also accepts a number of other options. One of which is encoding. Just like in the readable streams, it sets the default encoding to be used throughout the whole stream. The same can be set using .setDefaultEncoding() method. The difference in naming (“default” part) comes from the fact that it can be freely altered in every .write() call you make.

// ...
writable.setDefaultEncoding("utf8");

Beyond the “drain” event, writable streams emit a few more. Two from which you already know - “error” and “close”. They’re emitted on an error and e.g. on file descriptor close or .destroy() (also available for writable streams) method call respectively.

// ...
writable.on("error", err => {
    console.log(err);
});

writable.on("close", () => {
    console.log("No more operations will be performed!");
});

writable.destroy();

Writable streams also implements a few more properties similar to readable streams, but with slightly altered naming. Instead of “readable”, the “writable” phrase is used, for obvious reasons.
Such alteration can be seen in .writable property, which indicates if .write() method is safe to call, .writableHighWaterMark, and .writableLength, providing metadata about internal buffer size limit and it’s current size.

// ...
writable.writable; // true
writable.writableHighWaterMark; // 16384 by default
writable.writableLength; // number of bytes currently in buffer

Ending

Stream-writing data isn’t an endless process. To end it, you’ll need to call .end() method. It behaves just like the .write() method, just for allowing you to write your last chunk of data. The optional callback function can be treated as a handler for “finish” event, which is called directly after the stream ends. After all that, no more data can be written using the given stream and attempt of doing this will result in an error.

writable.end("The last chunk", "utf8", () => {
     console.log("Writable stream ended!");
     // Just like writable.on("finish", ...);
});

Piping

The .pipe() on the side of the writable stream doesn’t make much sense. That’s why the only reminiscents of the piping process here are “pipe” and “unpipe” events. Events occur when .pipe() and .unpipe() methods are called on readable stream side. For both callbacks, the piped readable stream is provided.

// ...
writable.on("pipe", readable => {
    console.log("Piped!");
});

Corks

Too many calls to the .write() method, when providing small chunks of data, may result in decreased performance. For such scenarios, writable streams provide .cork() and .uncork() method. After calling the .cork() method, all data written using .write() will be saved to memory instead of the buffer. In this way, the smaller data chunks can be easily batched for increased performance. You can later push the data from memory to buffer using .uncork() method. Know that these methods work linearly in somewhat LIFO-like (Last In First Out) order. The same number of .uncork() calls needs to be done as the .cork() method.

// ...
writable.cork();
writable.write("Hello");
writable.cork();
writable.write("World!");
process.nextTick(() => {
    stream.uncork();
    stream.uncork();
});

The trick with doing the .uncork() calls in the nextTick callback is yet another performance trick, which results in better performance through internal batching of .write() calls. We’ll learn a bit more about the process, together with it’s methods and properties in future posts.

File System streams

Phew… it’s been quite a ride, don’t you think? Still, we aren’t done. Remember the base examples from the overview above? I’ve used something like createReadableStreamSomehow(). It’s because I didn’t want to mess your mind with FS-related streams by then and the basic stream.Readable and stream.Writable class from “stream” module are just references for implementation that don’t handle events and other stuff properly. It’s time to fix this little error!

Read streams

FS API implements Readable Stream interface through fs.ReadStream class. It also exposes special method for instancing it - fs.createReadStream(). It takes a path to the file to be read as the first argument, and an optional config object as the second one.

const fs = require("fs");
const readStream = fs.createReadStream("file.js");

Config object accepts multiple properties. Two of which are already known to us - encoding and highWaterMark (in this implementation it defaults to 65536 ~ 64 KB). You can also pass flags string specifying FS flags and operation mode (see the previous article), although you most likely won’t use that very often. The same goes for fd property, which allows you to ignore passed path argument, and use provided file descriptor, obtained from fs.open() call.

// ...
const readStream = fs.createReadStream("file.js", {
    encoding: "utf8",
    highWaterMark: 128 * 1024
});

More interesting are the start, end and autoClose properties. Using the first two, you can specify the number of bytes from which you’d like to start and end the reading process. autoClose, on the other hand, is a boolean dictating whether the underlying file descriptor should be closed automatically (hence the name), resulting in the emission of “close” event.

// ...
const readStream = fs.createReadStream("file.js", {
    encoding: "utf8",
    end: 10
});
/* With "utf8" encoding, the "end" number of bytes, 
specifies the number of characters to read */

Of course, after the creation of a stream, the workflow remains mostly the same, as we’ve previously discussed. FS API implementation makes a few additions of its own. This involves events like “close”, “open” and “ready” - the new one - having direct connection with the underlying file descriptor. “open” fires when it’s opened, “close” - when it’s closed, and “ready” - immediately after “open” event when the stream is ready to be used. Additionally, there are some new properties - .path and .bytesRead, specifying the passed path of the read file (can be a string, buffer or URL object), and the number of bytes read by given point in time.

// ...
readStream.on("ready", () => {
    if(readStream.bytesRead === 0) { // meaningless check
        console.log(readStream.path);
    }
});

Keep in mind though, that these new additions shouldn’t affect the basic way of interacting with the stream. They exist only to provide you with more data.

Write streams

FS API write streams share many similarities with the readable ones - just like with its reference implementation. They’re created as instances of fs.WriteStream class, using fs.createWriteStream() method. It accepts almost identical config as one described previously, with the only difference being the lack of the end property, which is pointless in write streams anyway.

// ...
const writeStream = fs.createWriteStream("file.js", {
    encoding: "utf8",
    start: 10 // start writing from 10th byte
});

As for the Writable Stream implementation itself, again, very similar situation. “open”, “close” and “ready” events related to file descriptors, .path property is left untouched, and - the only difference - .bytesWritten property indicating the number of bytes already written.

// ...
writeStream.on("ready", () => {
    if(writeStream.bytesWritten === 0) { // meaningless check
        console.log(writeStream.path);
    }
});

What do you think?

I hope that this article served its purpose well - to explain a fairly complicated topic in a nice, understandable and informal way. Streams are vital to Node.js infrastructure and, thus, it’s very important concept to understand. If you like the article - I’m really happy. Remember to leave your opinion in the comments and with a reaction below! If you want, you can share it, so other people can learn the given topic faster. Also, you can follow me on Twitter, on my Facebook page, or through my weekly newsletter. This will allow you to stay up to date with this Node.js-related series and a lot of other beginner-friendly content from this blog. Again, thank you for reading this one, and I hope you’re having a great day!