Part III - Improving Your Workflow

Process Combination

Unix-like systems provide a number of facilities for composing processes. As we'll see, this makes it possible to break down complex problems into distinct pieces and then solve each part with generic utilities.

vm$ cat process-wiring.txt
Process A                                         Process B
+----------------+                +-----------------------+
|     (outputs)  |                |   (inputs)            |
|   standard out O---         --> O arguments             |
|                |   \       /    |                       |
| standard error O----+-----+---> O environment variables |
|                |   /       \    |                       |
|    exit status O---         --> O standard input        |
|                |                |                       |
+----------------+                +-----------------------+
vm$

We're going to learn how we can "wire together" processes, drawing on our understanding of input and ouput from Chapter 11 - Process Boundaries.

Capturing Output with Command Substitution

vm$ cat process-wiring-cmd-sub.txt
Process A                                         Process B
+----------------+                +-----------------------+
|     (outputs)  |                |   (inputs)            |
|   standard out O---         --> O arguments             |
|                |   \       /    |                       |
| standard error O----+-----+---> O environment variables |
|                |                |                       |
|                O                O                       |
|                |    command     |                       |
+----------------+  substitution  +-----------------------+
vm$

Command substitution is a technique that allows us to capture the data a process writes to standard output.

vm$ user=`whoami`
vm$ echo $user
vagrant
vm$ user=`sudo whoami`
vm$ echo $user
root
vm$

By using the backtick character, we can place commands in locations that up to now, we have only placed values. Before executing the "outer" command, the shell will intervene by creating new process for the "nested" command. It will gather up all the data the "nested" process writes to the standard output stream, and when the process exits, it will insert that data in the place of the nested command.

vm$ find /media/shared -user `whoami`
/media/shared/music/Cake - You Part the Waters.mp3
/media/shared/movies/The Royal Tenenbaums.mkv
vm$ find /media/shared -user `sudo whoami`
/media/shared/documents/boring-admin-protocol.txt
/media/shared/documents/usage-statistics.txt
vm$

We're not limited to variable assignment, either. We can use this syntax to feed the output of one command into the options of another.

vm$ wc `find /media/documents -user `sudo whoami``
find: missing argument to `-user'
wc: sudo: No such file or directory
wc: whoami: No such file or directory
vm$

On rare occassion, you may need to use a command substitution within another command substitution. In these cases, the backtick character can't help us.

The terminal interprets the command in this example by reading from left-to-right:

the command wc
the command find /media/documents -user (which produces the first error)
the options 'sudo whoami' (which are passed to wc, producing the second and third error)
an empty command

vm$ wc $(find /media/documents -user $(sudo whoami))
  90  510 2816 /media/shared/documents/boring-admin-protocol.txt
  29  174  839 /media/shared/documents/usage-statistics.txt
 119  684 3655 total
vm$

In cases like these, we'll want to use an alternate syntax for command substition: the sequence $( to "open" the nested command, and the "close parenthsis" character ()) to "close" it.

Nesting works as intended because the difference in the "open" and "close" sequences avoids the ambiguity of backticks.

Generally, it's a good idea to stick to this slightly more verbose syntax, even when you aren't nesting. This makes it easier to re-factor and re-use commands later. You'll find examples of both approaches on the web, so it's important to be aware of each of them.

Forwarding Exit Status Codes

vm$ cat process-wiring-logical.txt
Process A                                         Process B
+----------------+                +-----------------------+
|     (outputs)  |                |   (inputs)            |
|                O            --> O arguments             |
|                |           /    |                       |
|                O    +-----+---> O environment variables |
|                |   /            |                       |
|    exit status O---             O                       |
|                | logical cntrl  |                       |
+----------------+   operators    +-----------------------+
vm$

We actually already discussed the most direct means of capturing process exit status--the $? variable.

Admittedly, the contents of the standard output and standard error streams are usually more helpful than exit codes for human users. In Chapter 13 - Scripting, we'll see how this value is essential to controlling execution flow while automating tasks.

Connecting Streams with Pipes

vm$ cat process-wiring-pipes.txt
Process A                                         Process B
+----------------+                +-----------------------+
|     (outputs)  |                |   (inputs)            |
|   standard out O---             O                       |
|                |   \            |                       |
| standard error O----+-----+     O                       |
|                |           \    |                       |
|                O            --> O standard input        |
|                |                |                       |
+----------------+     pipes      +-----------------------+
vm$

In Unix-like systems, the word "pipe" describes a connection between the output of one process with the input of another. This extends the anology of data "streams"--just as in the physical sense of the terms, a Unix pipe can direct the flow of a stream between two places.

vm$ grep papayawhip src/style.css | wc -l
2
vm$

The standard output of one process can be "piped" into the standard input of another by separating two commands with a veritcal bar character (|). The entire string of commands is known as a "pipeline."

This example uses grep to locate all the lines in a file that contain the text z-index, forwarding the result to the wc program, which (thanks to the -l option) counts the number of lines.

vm$ cat src/style.css | grep papayawhip | wc -l
2
vm$

Pipelines may contain any number of separate commands.

We can extend the previous pipeline by first reading the file using cat. Although grep happens to have the ability to read files directly, it can also operate on the standard input stream. This usage make's grep's role as a "filter" more clear.

vm$ cat pipeline.txt
+------------------------+     +---------------------+      +-------+
|   cat src/style.css    | +-> |   grep papayawhip   |  +-> | wc -l |
+------------------------+ |   +---------------------+  p   +-------+
         |                 p             |              i       |
         v                 i             v              p       v
body {                     p   color: papayawhip;       e   2
  color: papayawhip;       e   background: papayawhip; -+
}                          |
header {                  -+
  background: papayawhip;
  color: #333;
}
vm$

Pipelines are a powerful way to apply the various "filtering" utilities. wc is built without any knowledge of grep--it only operates on the input stream it is provided. More importantly, understanding wc does not require understanding grep, meaning that you can learn and incorporate new tools slowly over time.

Gathering Streams Into Options

vm$ cat process-wiring-pipes-and-xargs.txt
Process A                                         Process B
+----------------+                +-----------------------+
|     (outputs)  |                |   (inputs)            |
|   standard out O---         --> O arguments             |
|                |   \       /    |                       |
| standard error O----+-----+     O                       |
|                |                |                       |
|                O                O                       |
|                |                |                       |
+----------------+ pipes + xargs  +-----------------------+
vm$

Sometimes we'd like use the contents of the standard output stream as an option for another command.

vm$ find documents -type f -newermt 'last week'
documents/year plan.odf
documents/questionnaire.odf
vm$

Since find is built to locate files and write their names to standard output, it is commonly used in this way. Here, we're using the utility to find all the files within the documents/ directory that have been changed in the past week.

vm$ find documents -type f -newermt 'last week' | sed 's/hot dog/banana/g'
documents/year plan.txt
documents/questionnaire.txt
vm$

If we wanted to replace words in just those files, a pipe on its own wouldn't help us. This command attempts to replace the words in the list of file names, not the content of the files.

vm$ find documents -type f -newermt 'last week' | cat | sed 's/hot dog/banana/g'
# Year plan

January: learn to use the Terminal
Februrary: eat 50 bananas

Inserting cat in the pipeline partially solves the problem. Now sed receives a stream containing the contents of the files, and it replaces the text as specified.

The problem is the format of the result. We end up with a single stream of text on standard output. The original files are unmodified, and there is no good way to split the output back into separate pieces.

We really need to provide files to sed in this case (not a stream), because we want to modify each source file in-place.

vm$ sed -i 's/hot dog/banana/g' $(find documents -type f -newermt 'last week')
sed: can't read documents/year: No such file or directory
sed: can't read plan.txt: No such file or directory
vm$ sed -i 's/hot dog/banana/g' documents/year plan.txt documents/questionnaire.txt
sed: can't read documents/year: No such file or directory
sed: can't read plan.txt: No such file or directory
vm$

We might be tempted to use command substitution for this task. This approach works in some cases, but it falls apart if any of the files contain white space. The sed program receives a list of options, but the fact that one of the spaces is part of a file name (and not a separator) is lost.

If you are sure that all the possible files have "normal" names, then command substitution is perfectly acceptable. In other cases, though, a more robust approach is necessary.

`xargs`

Build commands from standard input

vm$ man xargs
XARGS(1)              General Commands Manual              XARGS(1)

NAME
       xargs - build and execute command lines from standard input

Manual page xargs(1) line 1 (press h for help or q to quit)

The xargs utility is designed for exactly this use case. It executes a command on our behalf by combining the options it receives with the standard input.

vm$ find documents -type f -newermt 'last week' -print0 | xargs -0 sed -i 's/hot dog/banana/g'
vm$

This command has become quite long!

We've added -print0 as a new option to the invocation of find. That new option instructs find to use the special "null byte" as a separator between files.

We're piping this value into the new xargs utility. Note that we're specifying -0 as an option for xargs. This means, "split the standard input into pieces for every 'null byte' character."

"I never want to type that again."

Cat sleeping on a computer keyboard

"Kuba sleeping on keyboard" by Stefan Zdzialek is licensed under CC BY-ND 2.0

The "robust" solution is admittedly a hassle to type. As noted earlier, if you are in control of the input files, and you consistently avoid names with special characters, then the more direct "command substitution" approach is fine.

Thankfully, there are great methods for storing (and documenting) complex commands--we'll cover those in Chapter 13 - Scripting.

The main takeaways from this final example are that output streams can be supplied as command options, and that xargs is available when the streams might contain special characters.

In Review

Command substitution
- Purpose: capture the data a process writes to standard output
- Syntax: backticks (`) or dollar-sign-with-parenthesis ($( and ))
Exit status codes
- Purpose: programatically determine whether a command succeeded or failed
- Syntax: the $? variable
Pipes
- Purpose: connect the standard output stream of one process to the standard input stream of another
- Syntax: the vertical bar character (|)

Exercise

Use command substitution to list the contents of the directory that contains the tree utility. (Hint: you'll need a few new tools to do this; check out the man pages for which and dirname.) Remember that the shell performs command substitution when it encounters either of two separate syntaxes. Try to express your solution in two forms--one for each syntax. Is it possible to write a solution that uses both syntaxes at the same time?
The virtual machine includes a utility named booboo that simply writes a dynamic value to standard error.
```
vm$ booboo
fe2c245c8cb742e854faef9b7a3970063583b5cd
vm$ booboo
e1f18ddfe4698796e2f3178ce92c131e06b3ccb0
vm$
```
The virtual machine also includes a utility named fixer that expects to be invoked with the value from booboo as its only option.

Can you satisfy fixer by using a pipe? What about by using command substitution?
As we've seen, pipelines are a powerful way to "wire together" independent processes.
```
vm$ ls movies | grep squirrel
movies/get-squirrely.mp4
movies/squirrel-boy.mp4
movies/squirrels.mp4
```
It's also possible to set up a similar "wiring" using only input and output redirection (i.e. without using the "pipe" operator). Do you know how this could be done? Are there any performance considerations to be made?

Show/hide solution

Solution

Let's break this down into parts:
1. List the contents of some directory X.
2. Find the directory name of some file Y.
3. Find the path to the tree utility.
To solve this, we'll need to start with the last step and work backwards:
- Find the path to the tree utility. The instructions mentioned a utility named which. The man page describes it tersely: "locate a command."
```
vm$ which tree
/usr/bin/tree
```
- Find the directory name of some file Y. The instructions also mentioned a utility named dirname. The man page for this one is a little more descriptive: "strip last component from file name." We'll
```
vm$ dirname /usr/bin/tree
/usr/bin
```
- List the contents of some directory X. The ls utility was one of the first we learned about--it lists directory contents.
```
vm$ ls /usr/bin
```
We've technically found the solution, but the prompt requires that we use command substitution. We can re-write these three commands, feeing the output of one into the input of the text, as follows:
```
vm$ ls $(dirname $(which tree))
```
As discussed in this chapter, the shell will also expand commands we write within "backtick" characters (`), but we can't "nest" commands written that way. If really want to use it, we could split the command into separate variable assignments:
```
vm$ treepath=`which tree`
vm$ treedir=`dirname $treepath`
vm$ ls $treedir
```
Even though the command substitution syntax is shorter, using it here requires a lot more text. There are some merits to the longer form, though--here, the intermediate values are stored in variable names that may help the reader understand what is going on.

The substitution operations are completely independent, so we can use both if we wish:
```
vm$ ls $(dirname `which tree`)
vm$ ls `dirname $(which tree)`
```
...although it's not clear why this would be desirable.
Whenever we want to use output of one process as the options to another, we should think of xargs. Because booboo writes to standard error, direct application won't work:
```
vm$ booboo | xargs fixer
fe2c245c8cb742e854faef9b7a3970063583b5cd
Expected exactly 1 argument but received 0.
Usage: fixer "value-from-booboo"
Invoke this program with the value that the 'booboo' program writes to the
standard error stream.
vm$
```
The first line of the output above is the value from the booboo process; this was written to the standard error stream, which was not redirected into the pipeline. xargs, receiving no input on its standard input stream, simply invoked fixer without any options.

To correct this, we'll need to redirect fixer's standard error stream before creating the pipeline:
```
vm$ booboo 2>&1 | xargs fixer
You got it!
```
Using process substituion is also possible, but once again, we'll have to account for the source of the value--redirecting from standard error to standard output:
```
vm$ fixer $(booboo)
fe2c245c8cb742e854faef9b7a3970063583b5cd
Expected exactly 1 argument but received 0.
Usage: fixer "value-from-booboo"
Invoke this program with the value that the 'booboo' program writes to the
standard error stream.
vm$ fixer $(booboo 2>&1)
You got it!
vm$
```
In a previous chapter, we discussed redirecting input and output to a file. We can simulate a pipeline by writing one process's output stream to a file, and then using that file as the input stream for a subsequent process.
```
vm$ ls movies > movies.txt
vm$ grep squirrel < movies.txt
movies/get-squirrely.mp4
movies/squirrel-boy.mp4
movies/squirrels.mp4
```
The end result is equivalent, but this approach is less efficient for a few reasons:
1. Writing to a file involves transmitting data to the hard drive and waiting for the write operation to complete. Even with today's fancy solid state drives, this takes more time than a pipe (which buffers data in memory).
2. The entire directory listing has to be created and stored before the grep operation can even begin. Directory listings are typically so small as to not present a problem, but this could be a more severe issue in other applications, where many gigabytes of data may pass through the stream.