Streaming data in node.js

About streams in node.js.

Based on what I understood from Jawahar's talk during the January 2020 GeekNight (with me as part of audience) held at ThoughtWorks, Chennai.

Errors, if any, are my own. Corrections appreciated.


Consider two files medium.txt and large.txt.

medium.txt is a reasonably large file and large.txt is a huge file with respect to the computer hardware.

On a 4GB RAM computer, my medium.txt and large.txt span around 200MB and 2GB respectively.

Let's write a simple nodejs application in a file nostream.js to compress a given file.

const fs = require('fs');
const zlib = require('zlib');

const filePath = process.argv[2];

const data = fs.readFileSync(filePath);
zlib.gzip(data, compressedData => {
    fs.writeFileSync('out.gz', compressedData);
    console.log("Compression success!");
});

fs module is used for reading and writing files along with zlib's ~gzip()~ for the compression.

As a quick way of doing it, it is assumed that the input file name would be given as the first argument for the application invocation. This would make it the third parameter when the application is invoked using node. Hence the argv[2].

When we try to compress the 200MB medium.txt file, everything works fine.

$ node nostream.js medium.txt
Compression success!
$ du -h out.gz medium.txt
4.0K    out.gz
219M    medium.txt

(I used a trivial text file with repeating content as input. Hence the small size of the compressed file.)

However, on trying the same with the 2GB large.txt, it was a different story and an error that looked something like this showed up:

$ node nostream.js large.txt
fs.js:317
      throw new ERR_FS_FILE_TOO_LARGE(size);
      ^

RangeError [ERR_FS_FILE_TOO_LARGE]: File size (2162601906) is greater than possible Buffer: 2147483647 bytes
    at tryCreateBuffer (fs.js:317:13)
    at Object.readFileSync (fs.js:353:14)
    at Object.<anonymous> (/home/famubu/nostream.js:6:17)
    at Module._compile (internal/modules/cjs/loader.js:959:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:995:10)
    at Module.load (internal/modules/cjs/loader.js:815:32)
    at Function.Module._load (internal/modules/cjs/loader.js:727:14)
    at Function.Module.runMain (internal/modules/cjs/loader.js:1047:10)
    at internal/main/run_main_module.js:17:11 {
  code: 'ERR_FS_FILE_TOO_LARGE'
}

The error shows up because it can't handle so big a file.

The reason is that in this case, the application attempts to load the entire file data into the memory before initiating the compression process.

It need not be so.

This is where streams come in handy.

Streams

Stream in Node.js, is an interface for working with data that keep on coming. This means that all of the data needn't be there before we can start processing it.

That's the very nature of streaming data in general: it may even be an infinite stream (for example, temperature readings of CPU cores of a supercomputer). So we can't necessarily wait for all the data to be available.

There are four types of streams in Node.js.

The main objective of streams is to "limit the buffering of data to acceptable levels such that sources and destinations of differing speeds will not overwhelm the available memory".

Buffering in streams

An internal buffer is associated with every Readable and Writable stream.

The size of this internal buffer depends on the highWaterMark option of the stream's constructor.

For Readable streams, the data is read into the internal buffer till the limit specified by highWaterMark after which the reading process is paused. Reading can resume after enough data in the buffer has been consumed to make way for more data.

Likewise for Writable streams, the data to be written out is first stored inside the internal buffer from where it is consumed later.

Since Duplex and Transform streams can perform both reading and writing, each of them has two separate internal buffers. One each for reading and writing.

Example with streams

Let us try compressing the large.txt file again. This time using streams.

const fs = require('fs');
const zlib = require('zlib');

const filePath = process.argv[2];
const inputStream = fs.createReadStream(filePath);
const outputStream = fs.createWriteStream("out_stream.gz");
const compressTransformStream = zlib.createGzip();

inputStream
    .pipe(compressTransformStream)
    .pipe(outputStream)

This time, it works.

Let us modify this a bit and add a message using an event handler for the close event of the writable stream. This event is emitted when the file being written to is closed.

inputStream
    .pipe(compressTransformStream)
    .pipe(outputStream)
    .on('close', () => {
        console.log("Compression with streams success!"); 
    });

Running the new app,

$ node stream.js large.txt
Compression with streams success!
$ du -h out_stream.gz large.txt
2.1G    large.txt
908K    out_stream.gz

The compression works when streams are used because the size of the data buffer need not be as big as the file itself as the 'produced' data is consumed before the entire file is loaded, making a smaller buffer size possible. The app can start writing to the output file before the entire input file has been compressed. This way, the app is way easier on the computer's memory.

References

(Almost all of this article was originally written in 2020 before having access to the talk's video (not long before corona forced us all inside 😅). Got around to dusting it off only now.)