Substitute Expressions

2022-10-14

Regexes are much better at matching than they are at replacing, and I needed a similar DSL that could expand on regex to enable more powerful substitutions in a larger DSL I'm writing that will make use of it.

Regexes are really good at what they do, particularly matching. They're also great for substitutions by essentially using a regex to destructure a part of an input and then constructing a replacement for that part in the output. This is analogous to the following pseudocode:

// This is analogous to the regex matching expression
// It uses pattern matching to destructure
{
	first_name: x,
	last_name: y,
} = person

// This is analogous to the replacement expression
// It makes use of variables from the pattern
// to construct the output
output = x ++ " " ++ y

Unfortunately, I needed a language that behaved more like a standard expression where the input is destructed within the construction of the output, as this enables them to be very easily nested and combined, analogous to:

output = person.first_name ++ " " person.last_name

These different approaches both have advantages, but the former is split into two distinct sections, while the latter is a single unit. Continuing the analogy, the former has two trees, one for destructuring and the other being the AST of the expression, while the latter only has the expression AST. This is what makes nesting easier.

To cope with this additional requirement, I have been working on an alternative to regex that I call subex (Substitute Expressions). Note that subex is under active development and all of this is subject to change, I'm hoping the core concept will live on though.

The Theory

Regular expressions are famously equivalent to finite automata, which take in a string as an input and result in a boolean output. This means that using regexes for substitutions is a bit tricky as finite automata can't work like that.

Fortunately, there is a very similar concept known as a finite transducer that behaves a lot like a finite automata, but while it processes the input string it also constructs an output string.

Regular expressions are generally run by first compiling them into finite automata, and then running that finite automata on the input string. Running finite automata just consists of feeding them the input string one character at a time and then observing where the automaton ends once the whole input string has been read.

Transducers are very similar but as they read in characters they also output them. I decided to design subexes around this so the input is read left to right and the output is constructed as the input is being read.

Simple Subexes

A great starting point here is the following subex

The dot reads in any character and writes out that same character unchanged. This keeps it very close to its use in standard regex syntax. This is incredibly useful as, unlike most regex implementations, subexes always start at the beginning of the input and finish at the end meaning they either match exactly once or not at all. The dot character enables essentially leaving parts of the input unchanged writing them to the output exactly the same.

Characters, digits and most symbols will do the same as the dot but only if the input character matches, for example

will read in an 'a' and then output an 'a'. What happens if reading an 'a' is impossible? I haven't really decided yet, but I think the most sensible option is to output the input unchanged and have some way of detecting the lack of a match in the host language.

An asterisk works in the same way as in regex and repeats the preceeding expression 0 or more times. As with regex it is greedy so if there are multiple functioning options for how many times to repeat its input, it will do the most.

.*

This will read in and output the whole string unchanged.

Subexes can of course be concatenated simply by writing one and then the other. If this results in several possible ways of matching then the left expression has priority on which one is chosen, for example:

.*a.*

Will copy over as many characters as possible, then an 'a', and then as many characters as possible again. The two '.*' expressions are both greedy but the left one has the greater priority so 'a' will be matching the last 'a' in the input string.

This becomes particularly interesting when we introduce the minus.

.-a.*

This works in exactly the same way as the asterisk but will minimise the number of repetitions, meaning the 'a' in the above subex matches the first 'a' in the input instead of the last one.

Actually making substitutions

It is in fact possible to write to output without reading anything from input! This means we can add characters to a string, we simply wrap whatever we want to output in double quotes.

.-"("hello world")".*

This subex will wrap the first occurance of 'hello world' in brackets. It works by first copying across as few characters as possible, then outputting '(', then copying across specifically the string 'hello world', then outputting ')' and then the remaining input is copied.

It is also possible to read input without writing it to output. This is done using 'slots'. Using a dollar sign, followed by a character, followed by a regex will match the regex and store the matched input into a slot labelled with the character.

$_..*

This subex reads the first character (as it matches the regex for the store) and instead of outputting it, it stores it into slot '_'. It then copies across all of the remaining characters. This means the overall effect of the subex is to remove the first character from the string.

Slots can be read from using the dollar sign in an output expression.

$f..*"$f"

This subex reads the first character into slot 'f', then copies the rest across and then outputs from slot 'f', thus it moves the first character to the end of the string.

Other expressions

It is also possible to group expressions with brackets as with regex, union two expressions with a pipe symbol and much more.

(.-($_(bad)"good")|($_(good)"bad"))*.*

This final subex will replace all occurances of 'good' with 'bad' and vice versa.

Try it out

I've written a prototype in Go that seems to work well. I've given no thought to performance yet but since this is for a tiny scripting language it's not a huge concern, though I will work on ways of speeding it up as there is a lot of string copying going on at the moment.

prototype in Go

Always up for a discussion about programming language design at shtanton at shtanton.xyz