Parsing individual data items from huge JSON streams in Node.js
Let’s say you have a huge amount of JSON data
and you want to parse values from it in Node.js.
Perhaps it’s stored in a file on disk
or, more trickily,
it’s on a remote machine
and you don’t want to download the entire thing
just to get some data from it.
And even if it is on the local file system,
the thing is so huge that reading it in to memory
and calling JSON.parse
will crash the process
with an out-of-memory exception.
Today I implemented a new method
for my async JSON-parsing lib, BFJ,
which has exactly this type of scenario in mind.
BFJ already had a bunch of methods
for parsing and serialising
large amounts of JSON en masse,
so I won’t go into those here.
The readme
is a good place to start
if you want to learn more.
Instead,
this post is going to focus on
the new method, match,
which is concerned with picking individual records
from a larger set.
match takes 3 arguments:
A readable stream containing the JSON input.
A selector argument, used to identify matches from the stream. This can be a string, a regular expression or a predicate function. Strings and regular expressions are used to match against property keys. Predicate functions are called for each item in the data and passed two arguments,
keyandvalue. Whenever the predicate returnstrue, that value will be pushed to the stream.An optional options object.
It returns a readable, object-mode stream that will receive the matched items.
Enough chit-chat, let’s see some example code!
const bfj = require('bfj');
// Stream user objects from a file on disk
bfj.match(fs.createReadStream(path), 'user')
.pipe(createUserStream());
// Stream all the odd-numbered items from an array
bfj.match(fs.createReadStream(path), /[13579]$/)
.pipe(createOddIndexStream());
// Stream everything that looks like an email address from some remote resource
const request = require('request');
bfj.match(request(url), (key, value) => emailAddressRegex.test(value))
.pipe(createEmailAddressStream());
Those examples do not try to load all of the data into memory in one hit. Instead they parse the data sequentially, pushing a value to the returned stream whenever they find a match. The parse also happens asynchronously, yielding at regular intervals so as not to monopolise the event loop.
The approach can be used to parse items
from multiple JSON objects in a single source, too,
by setting the ndjson option to true.
For example,
say you have a log file
containing structured JSON data
logged by Bunyan
or Winston.
Specifying ndjson will cause BFJ
to treat newline characters as delimiters,
allowing you to pull out interesting values
from each line in the log:
// Stream uids from a logfile
bfj.match(fs.createReadStream(logpath), 'uid', { ndjson: true })
.pipe(createUidStream());
If you need to handle errors from the stream, you can do that by attaching event handlers:
const outstream = bfj.match(instream, selector);
outstream.on('data', value => {
// A matching value was found
});
outstream.on('dataError', error => {
// A syntax error was found in the JSON data
});
outstream.on('error', error => {
// An operational error occurred
});
outstream.on('end', value => {
// The end of the stream was reached
});
There’s lots more information in the readme so, if any of this sounds interesting, I encourage you to take a look!