<- Previous Back to post index Next ->

How not to adapt sed for JSON

2023-09-15

sed is a really impressive utility. It's simple, lightweight and incredibly versatile, able to solve an enormous variety of text transforming problems, and even some beyond that. If it isn't a tool you're familiar with I highly recommend you have a look into it and add it to your shell arsenal. Here is a very thorough guide but having a look at some sed related questions on StackOverflow first would give you a taste for what it can do. Some time ago, I started thinking about a new and improved tool based on sed but capable of dealing with more complex data than lines of text, in particular, I was thinking about all of the data formats that exist for tree-like data structures (JSON, YAML, TOML, XML, HTML, even syntax trees of programming languages as an extreme).

What about jq?

So far, the closest tool I've found is jq, which is pretty good at operating on JSON in the shell. I have one big and one small problem with jq. The small problem is that it parses the whole input into memory, which is unlikely to matter for the vast majority of instances but it would still be nice if it could be avoided when possible. The bigger problem is that it is way too complicated. jq has variables, user defined functions on top of the dozens of built in ones, and even a module system. This puts in more in the class of an interpreted language than a tool/utility. One of the reasons I tend to reach for sed is that I very rarely need to check the man page as there are only a few commands to remember, which I haven't found from my times using jq. Structured data with dynamic types like JSON is inherently more complex than lines of text, so it may be that the complexity of jq is a necessity to do its job, but I suspect that's not the case, and we won't know until we try!

Let's build something new

Based on this, I set out to design and build the tool I desired. It would be called stred, the streaming tree editor, and would have these objectives:

In this post, I'll focus just on working with JSON, and once we have a design that seems to work with that we can work out how it should be extended to other formats. Keep those other formats in the back of your mind though, we don't want to land on a design that is too dependant on the JSON format.

How to do streaming with JSON?

There isn't any single obvious way to take structured JSON data and turn it into a stream. Whatever we choose will likely look like a depth first search as if you read a JSON file from start to finish then that's roughly what you have. The option I first thought of and what I want to look deeper into in the rest of this article, is splitting the JSON file into 'tokens'.

Tokens?

If you haven't done any parsing before, the word token probably doesn't mean much to you in this context. The idea is that instead of thinking about JSON data as 'values' like an object or a number, we think about it as 'tokens', each of which is relatively simple. For example, numbers are simple enough to be tokens already, along with booleans and nulls. Arrays contain other values, but if we define arrayBegin (open square bracket) and arrayEnd (close square bracket) tokens, then we can still represent arrays with a bunch of tokens bookended by these special ones. Objects can work the same way.

We could split strings up into many tokens, but that feels like it would become unwieldy as these are the equivalent of lines in a text file. Let's keep each string as a single token.

JSON:
[["Jack", "Jill"], 2023]

Tokens (space separated):
[ [ "Jack" "Jill" ] 2023 ]

When we run stred, it starts at the beginning of the file and processes the contents into a stream of tokens. For each token, it runs a list of commands provided by the user and then outputs what is left.

What about object keys?

JSON:
{"key": "value"}

We could treat object keys as tokens, meaning they would be processed in the same way as strings. This has its problems though.

We'll make the decision to leave out objects keys as tokens, but they will be included in another way later on.

Tokens (space separated):
{ "value" }

The easy decisions

With a general design sketched out, we want to pin some concrete commands. The p, d and n commands from sed can be used in stred without any changes, just note that the d command implies that the pattern space must be able to contain nothing so our pattern space should be seen as a list of tokens, and not necessarily exactly one token (although it will usually be a single token, as the sed pattern space is usually one line). The a and i commands are useful in sed, but since it's really easy to get the same effect with a substition using only a couple more characters, I decided to leave them out. It also means that where each command ends becomes predictable without needing to end them with a semicolon which I think makes life a bit easier.

A more challenging command - Substitute: s

s is easily the most powerful command in sed. It's the command that actually modifies the lines that pass through sed, so dwarfs the commands we've looked at so far in depth and complexity (though still beautifully simple). You provide s with a regex to match, a replacement string, and a set of flags that control some aspects of its behaviour. With that, it will find a substring that matches the regex and replace it with the replacement string. Replacement strings can include references to parts of the match allowing for some more advanced transformations, which give s a lot of power for its simple design.

Subex

Regular expressions have some limitations that make them a bit awkward for stred. As you can't loop in the replacement part of the substitution, repeated replacements must use the global flag which can be awkward for nested matches. Imagine a 2 dimensional array, and you want to replace each of the elements of the array at index 1 (i.e. array[1][n] for each n). Ideally s/^([.[.*)./\1 5/g would work, but unfortunately, since we have to match from the start of the whole pattern space to correctly find the location, this will still only match once. This type of situation is rare when working with strings but much more common with structured data like what we're working with. To deal with this, I opted to use subex for the substitutions.

Decomposing tokens

sed iterates through lines in a file, but the substitution command uses regular expressions that break lines down into characters. We've decided that lines are equivalent to tokens, so we need to work out what equivalent for characters should be.

Atomic tokens

startObject, endObject, startArray, endArray, null, true and false, all can't be broken down into anything smaller, so a subex has to deal with them as a whole. Each of these elements that a subex will work with will be the smallest stred can break anything down, so I'm going to call them atoms, so these atomic tokens are the first 7 types of atoms.

Strings

Strings definitely need to be split up, since that's what makes most string operations possible. It means that working with strings in stred can be pretty similar to working with lines in sed. If we said a string was just its characters, then we wouldn't know where they end, so we definitely need a marker for the end of a string. We don't technically need a marker for the start of a string, but to fit with JSON (and basically every other format) I decided to use the same marker for both the start and end of a string, which I'll call a string terminal.

Numbers

I feel like there must be some great way of splitting up numbers so a subex can operate on them with the same operations used for everything else. It would essentially need to be a bijection between the reals (to reasonable precision) and a finite sequence of discrete values. This is already tricky, but even worse it needs to be simple enough that humans can understand and reason about it, and that standard arithmetic operations can be done with subex operations.

I eventually decided to treat numbers as tokens (for now!). To handle them, I added the standard arithmetic operators that behave a lot like the store feature in subex, capturing the output from the subex before them, but instead of storing it for later they perform the arithmetic on those values and then output the result. This looks a lot like postfix arithmetic expressions (to add two values together you'd use ..+) which are totally usable.

Spaces

sed has two spaces.

Two different spaces with two different behaviours and uses, very elegant. It would still be a bit more convenient to have more than two, since it's sometimes useful to store more pieces of information (which isn't impossible in sed, just a little awkward), so I decided to have x, y and z spaces as general purpose spaces instead of the single hold space.

Line Numbers

sed allows you to selectively execute some commands based on the content of the pattern space (with regex), but also you can filter based on the line number of the input. For stred, using line numbers doesn't make sense, but once again there is a natural candidate for their replacement. While each line of a text file corresponds with a line number, each element of a JSON file corresponds with a 'path', the route taken from the root node of the file to that element.

Unfortunately, while in sed we only need to read the line numbers (since the line number of the output is determined by how many lines have already gone), if we want to output a property of an object, it needs to know the path so it can output the appropriate key. What we effectively have then, is a sequence of strings and numbers that we need to be able to read (for conditional execution) and write. Fortunately, that looks a lot like a space, so we introduce the 'path space' which stores the path for the current token. Each time stred loads a token into the pattern space, it also loads the path to that token into the path space, and when it outputs the token, if it is inside an object it uses the last token in the path space as the key.

Branching

sed has 3 commands for branching. : will define a label, and b will jump (branch) to that label. We also then have t (test), which will only jump if the last substitution succeeded. There's not really any reason why stred needs anything different to these, but we could take the opportunity to try something new anyway.

First, if we say that all labels must be a single character, we still don't need semicolons to mark the end of the label. Second, we notice that almost all substitutions fall into one of two categories, either success is guaranteed or we care about whether it succeeded. Based on that, we'll say that a substitution command always skips the next command after it if it fails. If success is guaranteed then it will always execute the next command as we would expect, but if we want to do something special if it fails then we put that in just after the s command. There is the rare occurance where a substitution might fail but we don't want to do anything if it does. In that case, we can use a noop command after the substitution.

Examples

Example 1: Extracting a value from a larger JSON file

stred -n 'S/"path""to"1"value"(.*)/p'

Explanation

Because the content of the pattern space are only printed if they start with our target path, the result is the part of the JSON input at that path.

Example 2: Appending to a list

stred 'S/"array"/s/~"my new element"~`]`/'

This is a little bit clever since we've combined the arrayEnd token match with the insertion.

Example 3: Summing a list of numbers

stred -n 'S/"numbers"/s/`[`/ {xs/~`0`~/x}  S/"numbers"%/{ Xxs/%%+/x } S/"numbers"/s/`]`/ {xp}'

Going back to my opening point, this might seem like a long command considering jq has a filter (add) that just does this, but I only know that because I checked the manual, which took longer than writing this command. stred (and sed) provide a small number of simple tools that still versatile enough to do niche tasks without needing niche features.

Disadvantages

I've realised a few disadvantages with this design, which I'll go through in a pretty subjective order from least problematic to most problematic.

Subex may be unnecessarily complex

Subex has some real advantages, and it feels like the theory behind it is better suited to the task, but on reflection, I'm really not sure. Supporting both subex and regex would be excessive additional complexity (and I think it would be confusing), but using only subex means that any matching also has to be potentially a transform, which restricts how they can be used. For example, you may have noticed that all of the selection is done with substitutions that don't change anything, which is a bit clunky.

Working with tokens can be awkward

The natural way to break down JSON is into values, so working with it in tokens is a bit awkward to wrap your head around. Additionally, brackets are often not paired, which I find makes scripts harder to read.

Character literals need to be distinguished from other kinds of literals

You probably noticed that there were a lot of backticks (`) in my examples earlier. I used them because subex needs some way to tell the difference between, say, the number 0 and the character 0. It's just another awkward bit of syntax that could be easily forgotten, breaking a script.

The N command

sed has a command, N, that reads in the next line and appends it to the pattern space. It's useful and would also be useful in stred (such as for working with a whole data structure at once). The problem is, suppose we have a token in the pattern space at the path "first_key", and we use the N command to load the next token in, what should happen to the path space? We now have two tokens in the pattern space but we can only have one path in the path space. We could try to come up with a system to keep both paths, but not without not without introducing far more complexity than I'm happy with.

It struggles to deal with more than a few tokens at once

I think the best way to explain this is with an example. Imagine I have an array, where each element is a large object, with many other nested arrays and objects inside of it. Each of these objects has a "keep" property, that is a boolean, and I want to generate an array with only the objects that have a "keep" property of true.

As stred passes over the start of a large object, it needs to keep the whole object so far in memory as it doesn't yet know whether to output the early tokens (probably in a general purpose space since the N command isn't available). I want to load the whole object into memory, but there isn't an easy way to know if the object loaded so far is complete. Additionally, once it is all loaded, I want to check the "keep" property, but need to make sure I don't accidentally find a "keep" property that is nested deeper. These specific problems can be mitigated, but that just throws up more problems and eventually you reach a state of too much complexity.

Conclusion

This design has a lot going for it, but frustratingly, it seems like no design based around tokenisation will be able to handle the difficulties of working with a larger JSON structure as a unit. I did implement these ideas into a prototype in go for experimenting, and if you want to have a look you can clone stred-go and checkout the commit e98ebbad387def55d8347adb5bf45034d542cce0 from before I did some big redesigns. My plan is to write about the redesigns too.

If you've found this interesting and want to discuss it with me, you can reach me at: shtanton at -thisdomain-