commit 2f787b18f5666c44236716a78095372ea1925aa5
parent b6402e6ed03371b29aefe0685774c66c9fd2b801
Author: Charlie Stanton <charlie@shtanton.com>
Date: Thu, 9 Sep 2021 15:33:42 +0100
Finish the second draft of the piping post
Diffstat:
1 file changed, 66 insertions(+), 76 deletions(-)
diff --git a/posts/better_than_stdio.gmi b/posts/better_than_stdio.gmi
@@ -1,93 +1,74 @@
-I was writing a post where I would design a language that would be better than the shell for running a bunch of processes in a pipeline with data flowing between them. It turns out that you can do a lot more with the a POSIX shell than I had realised. This post does assume a familiarity with the linux shell and C, but I will try to explain everything I do.
+A key feature of the POSIX shell is the pipe, allowing the output of one process to be "piped" into the input of another. This is extremely powerful, but sadly has several flaws that led me to start writing my own alternative shell-alike. Rather typically, while researching for this project I found that almost every one of my "gripes with pipes" was totally unfounded, so I have documented my discoveries here in the hope that I can save others the hours I wasted trying to best the POSIX shell.
-My gripes with pipes in the shell were as follows:
-- Only for passing string data delimited by newlines.
-- Each process only has 1 input and output stream (stdin and stdout).
-- Data streams can't be combined or duplicated.
-- Can't do cyclical data passing.
-- Only supports message passing style i.e. no shared memory or semaphores.
+This post will involve a fair amount of POSIX shell scripting and C programming, but I've explained all of it as best I can so you should be fine with a surface level understanding of these things.
-I will now dissect and destroy my own arguments against the shell, hopefully saving you the effort of designing a DSL and then scrapping it like I did.
+### Charlie's 4 gripes with pipes
+- Pipes can only pass string data delimited by newlines
+- Each process only has 1 input and 2 output streams (stdin, stdout and stderr)
+- Pipes can't do cyclical data passing
+- All IPC must be byte streams
-## Passing any data I want
+## Gripe 1: I can only pass newline delimited strings
-I was used to writing C that looked like this (only with way better style), as this is the way stdio is used most commonly in my experience:
+Every stdio-based unix utility I've ever used both takes as input and gives as output a series of lines. I think this is because if nothing is being piped into or out of a process, then it uses the tty, which takes lines of input from the user and displays lines of output. In reality, since stdin and stdout are both byte streams, any data that can be represented as bytes (all of it) can be passed through a pipe. Note that 0 and 1 in read and write are file descriptors referring to stdin and stdout respectively. A file descriptor is just a number that refers to a file in POSIX.
-main.c: writing "hello world" to file descriptor 1 (stdout)
-```
-#include <unistd.h>
-
-int main() {
- write(1, "hello world", 11);
- return 0;
-}
-```
-
-receiver.c: reading and printing from file descriptor 0 (stdin)
-```
-#include <unistd.h>
-#include <stdio.h>
-
-int main() {
- char buf[80];
- read(0, &buf, 80);
- printf("%s\n", buf);
- return 0;
-}
-```
-
-These can be compiled into executables and run like so:
-```
-./main | ./receiver
-```
-which will output "hello world" to the shell. This is because the data ("hello world") in main is "piped" into receiver which reads it and then displays it, the pipe takes stdout from main and turns it into stdin for receiver. However, there really is nothing about this that stops us sending other data types over the pipe:
-
-main.c
+main.c: Define a person "charlie" and write the data to stdout (file descriptor 1)
```
#include <unistd.h>
#include <string.h>
-
#include "types.h"
int main() {
struct Person charlie;
strcpy(charlie.first_name, "Charlie");
strcpy(charlie.last_name, "Stanton");
+ charlie.age = 21;
write(1, &charlie, sizeof(struct Person));
return 0;
}
```
-receiver.c
+receiver.c: Read a person from stdin (file descriptor 0) and display their full name and age
```
#include <unistd.h>
#include <stdio.h>
-
#include "types.h"
int main() {
- struct Person charlie;
- read(0, &charlie, sizeof(struct Person));
- printf("%s %s\n", charlie.first_name, charlie.last_name);
+ struct Person person;
+ read(0, &person, sizeof(struct Person));
+ printf("%s %s %d\n", person.first_name, person.last_name, person.age);
return 0;
}
```
-types.h
+types.h: A person has a first name, last name and age
```
struct Person {
char first_name[30];
char last_name[30];
+ signed char age;
};
```
-Which can be compiled and run to output "Charlie Stanton" even though the data passed through the pipe wasn't a string! Just because basically every Unix utility uses string and splits the data by line when reading it, it doesn't mean we have to. There's no reason why we can't use the pipe for a live video/audio stream, cap'n proto, data coming from some input device or anything else we want. Strings separated by newlines is quite a strong convention, but I think it's safe to completely ignore when convenient.
+Then in my shell I can compile these and run:
+```
+$ ./main | ./receiver
+Charlie Stanton 21
+```
-## Multiple input and output streams to/from a process
+This is an example of a C struct containing 2 strings and a byte integer being passed between processes using a pipe, and there's no reason we couldn't go further. There are already utilities that communicate with JSON over stdio, but we could send video/audio streams, cap'n proto data or anything else we want.
-What if in my pipeline I want to do further processing on both the first names and the last names of people, but separately? Well, file descriptors 0, 1 and 2 are set aside for stdin, stdout and stderr respectively, but it turns out we can use other file descriptors and still access them from the shell.
+## Gripe 2: Each process has a set 1 input and 2 output streams
-main.c
+Every process has stdin (fd 0), stdout (fd 1) and stderr (fd 2). So those are your 3 options for io right? Well other file descriptors are available but we need some new shell syntax to effectively use them:
+```
+command 5> file
+```
+
+This opens file and gives it file descriptor 5, which means when we write to 5 in command, it will write to file. This gives us an additional 7 inputs/outputs that our processes can make use of.
+
+main.c: We have two people, the first names get written to stdout, while the last names get written to file descriptor 3
```
#include <unistd.h>
#include <string.h>
@@ -99,9 +80,11 @@ int main() {
strcpy(people[0].first_name, "Charlie");
strcpy(people[0].last_name, "Stanton");
+ people[0].age = 21;
strcpy(people[1].first_name, "John");
strcpy(people[1].last_name, "Smith");
+ people[1].age = 40;
for (int i = 0; i < 2; i++) {
write(1, people[i].first_name, strlen(people[i].first_name));
@@ -113,59 +96,66 @@ int main() {
}
```
-Here we are writing the first names to stdout (file descriptor 1) and the last names to file descriptor 3. These are two different output streams and so we can send each one to a different file:
```
-./main > first_names 3> last_names
+$ ./main > first_names 3> last_names
+$ tee < first_names
+Charlie
+John
+$ tee < last_names
+Stanton
+Smith
```
-We use 3> to indicate that we want to read from file descriptor 3 and write into last_names, we can do this with input too so the restriction on the number of inputs and outputs of each process is no longer a problem. This works perfectly for reading and writing files, but what if our input is coming from 2 other commands? We can use a pipe for one of them but what about the other? Fortunately, we have fifos.
-A FIFO (first in first out) is a file which doesn't behave at all like a regular file. You can read and write them, but they behave more like a queue. When you write something, it joins the end of the queue and when you read, you take something off the front of the queue. These are incredibly versatile for piping our data around, they are also known as named pipes because they are really the same, but are given a name in the filesystem so more than 2 processes can interact with them.
-To use fifos to solve our problem, we can type in the shell:
+We can also use the extra file descriptors as inputs with 3< and the like. This lets us output to 2 files, and one of them we could pipe into a further command with the help of redirections:
```
-mkfifo /tmp/first_names && mkfifo /tmp/last_names
-echo Charlie > /tmp/first_names & echo Stanton > /tmp/first_names & ./main < /tmp/first_names 3< /tmp/last_names
+$ ./main 3>&1 > first_names | sort
+Smith
+Stanton
+$ tee < first_names
+Charlie
+John
```
+but how can both outputs become inputs for other commands?
-This gives us the flexibility we had with files but with streams of data being passed between processes.
-
-## Combining and duplicating data streams
-
-Combining data streams is somewhat complicated, it is easy to merge 2 byte streams, but there is no guarantee that a message from each won't get mangled together with interleaved bytes. I think a custom executable is needed to properly combine 2 data streams which can properly take into account the type of data being transfered. Duplicating a stream is simpler, and using a fifo and the tee utility, very straightforward.
+### First In First Out (fifo)
+A fifo does what it says on the tin. You write bytes to it, and the bytes are read in the order they were written. They are also known as named pipes which hints at what we can use them for:
```
-mkfifo /tmp/fifo
-echo hello | tee /tmp/fifo | cat & cat /tmp/fifo
+$ mkfifo /tmp/first_names
+$ mkfifo /tmp/last_names
+$ ./main > /tmp/first_names 3> /tmp/last_names & sort < /tmp/first_names > first_names & sort < /tmp/last_names > last_names
```
-Here we use tee, which passes its stdin straight through to stdout, but also copies it to any files given as arguments. This allows us to duplicate the stream and provide it as an input to two different calls to cat.
+These commands create two fifos in the /tmp directory, and uses one for each output of main, then assigns each to be the input of an instance of sort. This achieves our goal of a process with 2 outputs each going into a different process. Unfortunately, it's quite a lot of typing to go down this route and while some shells have made attempts to add syntax to alleviate this, POSIX doesn't define a better method for doing this.
-## Cyclical data passing
-
-With the help of a fifo we can also create cycles in our data dependency. We could recreate the yes utility like so:
+## Gripe 3: Cyclic data dependency is impossible
+Now that we have fifos in our toolbelt, cyclic data dependency is easy:
```
mkfifo /tmp/fifo
-echo hello > /tmp/fifo & tee /tmp/fifo < /tmp/fifo
+echo yes fifo > /tmp/fifo & tee /tmp/fifo < /tmp/fifo
```
-tee is both reading and writing /tmp/fifo but also outputting every time it does so, this creates a very simple cyclical data loop.
+This makes a fun version of the yes utility (which loops forever, repeatedly printing the same string). The tee command takes its input from /tmp/fifo but also outputs to /tmp/fifo as well as to the tty, giving us the cycle we wanted.
-## Byte streams are the only form of data passing
+## Gripe 4: Byte streams are the only form of data passing
POSIX and System V (two sets of standards that many operating systems adhere to) both define 3 types of IPC
- Message queues
- Shared memory
- Semaphores
-but in the shell we only have access to byte streams, is this a limitation?
+but in the shell we only have access to byte streams, how come? Well...
Message queues and byte streams are very similar, in fact a byte stream could easily act as a message queue. If the messages are all of the same size then they can very easily be passed and received without any encoding at all and if the length varies then prepending each with an 8 byte length which is read to tell the receiver how big the rest of the message is could be used.
Shared memory is more performant and more natural than message passing for some tasks, but in my experience there aren't that many of these and they can often be very easily translated into a message passing style. For example, if a small amount of data is being shared, then the data could be resent every time it changes, if it is a larger amount then a message could indicate a part of the data and what to set it to. Shared memory opens up a lot of potential for data races without careful consideration so I don't think it belongs on the shell.
-Semaphores can be very easily simulated by sending a single byte one way for up and the other way for down.
+Semaphores can be very easily simulated by sending a single byte one way for up and the other way for down. This is less performant but that's not really a problem on the shell.
Byte streams aren't always the most natural or the fastest solution to a problem, but they work very well most of the time and they are intuitive, easy to reason with, versatile and robust. I don't think the shell needs anything else.
-## Problems
+## Conclusion
+
+Make fifos less verbose and I would consider the shell perfect for data passing!
-I do find it annoying that the syntax for using fifos is so verbose, which is my one criticism of this system.
+Feel free to email charlie@shtanton.xyz with any feedback or if you just want to start an email conversation with someone very opinionated about programming and operating systems.