Part III - Improving Your Workflow
Process Combination
Unix-like systems provide a number of facilities for composing processes. As we'll see, this makes it possible to break down complex problems into distinct pieces and then solve each part with generic utilities.
vm$ cat process-wiring.txt
Process A Process B
+----------------+ +-----------------------+
| (outputs) | | (inputs) |
| standard out O--- --> O arguments |
| | \ / | |
| standard error O----+-----+---> O environment variables |
| | / \ | |
| exit status O--- --> O standard input |
| | | |
+----------------+ +-----------------------+
vm$
We're going to learn how we can "wire together" processes, drawing on our understanding of input and ouput from Chapter 11 - Process Boundaries.
Capturing Output with Command Substitution
vm$ cat process-wiring-cmd-sub.txt
Process A Process B
+----------------+ +-----------------------+
| (outputs) | | (inputs) |
| standard out O--- --> O arguments |
| | \ / | |
| standard error O----+-----+---> O environment variables |
| | | |
| O O |
| | command | |
+----------------+ substitution +-----------------------+
vm$
Command substitution is a technique that allows us to capture the data a process writes to standard output.
vm$ user=`whoami`
vm$ echo $user
vagrant
vm$ user=`sudo whoami`
vm$ echo $user
root
vm$
By using the backtick character, we can place commands in locations that up to now, we have only placed values. Before executing the "outer" command, the shell will intervene by creating new process for the "nested" command. It will gather up all the data the "nested" process writes to the standard output stream, and when the process exits, it will insert that data in the place of the nested command.
vm$ find /media/shared -user `whoami`
/media/shared/music/Cake - You Part the Waters.mp3
/media/shared/movies/The Royal Tenenbaums.mkv
vm$ find /media/shared -user `sudo whoami`
/media/shared/documents/boring-admin-protocol.txt
/media/shared/documents/usage-statistics.txt
vm$
We're not limited to variable assignment, either. We can use this syntax to feed the output of one command into the options of another.
vm$ wc `find /media/documents -user `sudo whoami``
find: missing argument to `-user'
wc: sudo: No such file or directory
wc: whoami: No such file or directory
vm$
On rare occassion, you may need to use a command substitution within another command substitution. In these cases, the backtick character can't help us.
The terminal interprets the command in this example by reading from left-to-right:
- the command
wc
- the command
find /media/documents -user
(which produces the first error) - the options 'sudo whoami' (which are passed to
wc
, producing the second and third error) - an empty command
vm$ wc $(find /media/documents -user $(sudo whoami))
90 510 2816 /media/shared/documents/boring-admin-protocol.txt
29 174 839 /media/shared/documents/usage-statistics.txt
119 684 3655 total
vm$
In cases like these, we'll want to use an alternate syntax for command
substition: the sequence $(
to "open" the nested command, and the "close
parenthsis" character ()
) to "close" it.
Nesting works as intended because the difference in the "open" and "close" sequences avoids the ambiguity of backticks.
Generally, it's a good idea to stick to this slightly more verbose syntax, even when you aren't nesting. This makes it easier to re-factor and re-use commands later. You'll find examples of both approaches on the web, so it's important to be aware of each of them.
Forwarding Exit Status Codes
vm$ cat process-wiring-logical.txt
Process A Process B
+----------------+ +-----------------------+
| (outputs) | | (inputs) |
| O --> O arguments |
| | / | |
| O +-----+---> O environment variables |
| | / | |
| exit status O--- O |
| | logical cntrl | |
+----------------+ operators +-----------------------+
vm$
We actually already discussed the most direct means of capturing process exit
status--the $?
variable.
Admittedly, the contents of the standard output and standard error streams are usually more helpful than exit codes for human users. In Chapter 13 - Scripting, we'll see how this value is essential to controlling execution flow while automating tasks.
Connecting Streams with Pipes
vm$ cat process-wiring-pipes.txt
Process A Process B
+----------------+ +-----------------------+
| (outputs) | | (inputs) |
| standard out O--- O |
| | \ | |
| standard error O----+-----+ O |
| | \ | |
| O --> O standard input |
| | | |
+----------------+ pipes +-----------------------+
vm$
In Unix-like systems, the word "pipe" describes a connection between the output of one process with the input of another. This extends the anology of data "streams"--just as in the physical sense of the terms, a Unix pipe can direct the flow of a stream between two places.
vm$ grep papayawhip src/style.css | wc -l
2
vm$
The standard output of one process can be "piped" into the standard input of
another by separating two commands with a veritcal bar character (|
). The
entire string of commands is known as a "pipeline."
This example uses grep
to locate all the lines in a file that contain the
text z-index
, forwarding the result to the wc
program, which (thanks to the
-l
option) counts the number of lines.
vm$ cat src/style.css | grep papayawhip | wc -l
2
vm$
Pipelines may contain any number of separate commands.
We can extend the previous pipeline by first reading the file using cat
.
Although grep
happens to have the ability to read files directly, it can also
operate on the standard input stream. This usage make's grep
's role as a
"filter" more clear.
vm$ cat pipeline.txt
+------------------------+ +---------------------+ +-------+
| cat src/style.css | +-> | grep papayawhip | +-> | wc -l |
+------------------------+ | +---------------------+ p +-------+
| p | i |
v i v p v
body { p color: papayawhip; e 2
color: papayawhip; e background: papayawhip; -+
} |
header { -+
background: papayawhip;
color: #333;
}
vm$
Pipelines are a powerful way to apply the various "filtering" utilities. wc
is built without any knowledge of grep
--it only operates on the input stream
it is provided. More importantly, understanding wc
does not require
understanding grep
, meaning that you can learn and incorporate new tools
slowly over time.
Gathering Streams Into Options
vm$ cat process-wiring-pipes-and-xargs.txt
Process A Process B
+----------------+ +-----------------------+
| (outputs) | | (inputs) |
| standard out O--- --> O arguments |
| | \ / | |
| standard error O----+-----+ O |
| | | |
| O O |
| | | |
+----------------+ pipes + xargs +-----------------------+
vm$
Sometimes we'd like use the contents of the standard output stream as an option for another command.
vm$ find documents -type f -newermt 'last week'
documents/year plan.odf
documents/questionnaire.odf
vm$
Since find
is built to locate files and write their names to standard output,
it is commonly used in this way. Here, we're using the utility to find all the
files within the documents/
directory that have been changed in the past
week.
vm$ find documents -type f -newermt 'last week' | sed 's/hot dog/banana/g'
documents/year plan.txt
documents/questionnaire.txt
vm$
If we wanted to replace words in just those files, a pipe on its own wouldn't help us. This command attempts to replace the words in the list of file names, not the content of the files.
vm$ find documents -type f -newermt 'last week' | cat | sed 's/hot dog/banana/g'
# Year plan
January: learn to use the Terminal
Februrary: eat 50 bananas
Inserting cat
in the pipeline partially solves the problem. Now sed
receives a stream containing the contents of the files, and it replaces the
text as specified.
The problem is the format of the result. We end up with a single stream of text on standard output. The original files are unmodified, and there is no good way to split the output back into separate pieces.
We really need to provide files to sed
in this case (not a stream), because
we want to modify each source file in-place.
vm$ sed -i 's/hot dog/banana/g' $(find documents -type f -newermt 'last week')
sed: can't read documents/year: No such file or directory
sed: can't read plan.txt: No such file or directory
vm$ sed -i 's/hot dog/banana/g' documents/year plan.txt documents/questionnaire.txt
sed: can't read documents/year: No such file or directory
sed: can't read plan.txt: No such file or directory
vm$
We might be tempted to use command substitution for this task. This approach
works in some cases, but it falls apart if any of the files contain white
space. The sed
program receives a list of options, but the fact that one of
the spaces is part of a file name (and not a separator) is lost.
If you are sure that all the possible files have "normal" names, then command substitution is perfectly acceptable. In other cases, though, a more robust approach is necessary.
xargs
Build commands from standard input
vm$ man xargs
XARGS(1) General Commands Manual XARGS(1)
NAME
xargs - build and execute command lines from standard input
Manual page xargs(1) line 1 (press h for help or q to quit)
The xargs
utility is designed for exactly this use case. It executes a
command on our behalf by combining the options it receives with the standard
input.
vm$ find documents -type f -newermt 'last week' -print0 | xargs -0 sed -i 's/hot dog/banana/g'
vm$
This command has become quite long!
We've added -print0
as a new option to the invocation of find
. That new
option instructs find
to use the special "null byte" as a separator between
files.
We're piping this value into the new xargs
utility. Note that we're
specifying -0
as an option for xargs
. This means, "split the standard input
into pieces for every 'null byte' character."
"I never want to type that again."
"Kuba sleeping on keyboard" by Stefan Zdzialek is licensed under CC BY-ND 2.0
The "robust" solution is admittedly a hassle to type. As noted earlier, if you are in control of the input files, and you consistently avoid names with special characters, then the more direct "command substitution" approach is fine.
Thankfully, there are great methods for storing (and documenting) complex commands--we'll cover those in Chapter 13 - Scripting.
The main takeaways from this final example are that output streams can be
supplied as command options, and that xargs
is available when the streams
might contain special characters.
In Review
-
Command substitution
- Purpose: capture the data a process writes to standard output
- Syntax: backticks (
`
) or dollar-sign-with-parenthesis ($(
and)
)
-
Exit status codes
- Purpose: programatically determine whether a command succeeded or failed
- Syntax: the
$?
variable
-
Pipes
- Purpose: connect the standard output stream of one process to the standard input stream of another
- Syntax: the vertical bar character (
|
)
Exercise
-
Use command substitution to list the contents of the directory that contains the
tree
utility. (Hint: you'll need a few new tools to do this; check out the man pages forwhich
anddirname
.) Remember that the shell performs command substitution when it encounters either of two separate syntaxes. Try to express your solution in two forms--one for each syntax. Is it possible to write a solution that uses both syntaxes at the same time? -
The virtual machine includes a utility named
booboo
that simply writes a dynamic value to standard error.vm$ booboo fe2c245c8cb742e854faef9b7a3970063583b5cd vm$ booboo e1f18ddfe4698796e2f3178ce92c131e06b3ccb0 vm$
The virtual machine also includes a utility named
fixer
that expects to be invoked with the value frombooboo
as its only option.Can you satisfy
fixer
by using a pipe? What about by using command substitution? -
As we've seen, pipelines are a powerful way to "wire together" independent processes.
vm$ ls movies | grep squirrel movies/get-squirrely.mp4 movies/squirrel-boy.mp4 movies/squirrels.mp4
It's also possible to set up a similar "wiring" using only input and output redirection (i.e. without using the "pipe" operator). Do you know how this could be done? Are there any performance considerations to be made?
Solution
-
Let's break this down into parts:
- List the contents of some directory X.
- Find the directory name of some file Y.
- Find the path to the
tree
utility.
To solve this, we'll need to start with the last step and work backwards:
-
Find the path to the
tree
utility. The instructions mentioned a utility namedwhich
. Theman
page describes it tersely: "locate a command."vm$ which tree /usr/bin/tree
-
Find the directory name of some file Y. The instructions also mentioned a utility named
dirname
. Theman
page for this one is a little more descriptive: "strip last component from file name." We'llvm$ dirname /usr/bin/tree /usr/bin
-
List the contents of some directory X. The
ls
utility was one of the first we learned about--it lists directory contents.vm$ ls /usr/bin
We've technically found the solution, but the prompt requires that we use command substitution. We can re-write these three commands, feeing the output of one into the input of the text, as follows:
vm$ ls $(dirname $(which tree))
As discussed in this chapter, the shell will also expand commands we write within "backtick" characters (
`
), but we can't "nest" commands written that way. If really want to use it, we could split the command into separate variable assignments:vm$ treepath=`which tree` vm$ treedir=`dirname $treepath` vm$ ls $treedir
Even though the command substitution syntax is shorter, using it here requires a lot more text. There are some merits to the longer form, though--here, the intermediate values are stored in variable names that may help the reader understand what is going on.
The substitution operations are completely independent, so we can use both if we wish:
vm$ ls $(dirname `which tree`) vm$ ls `dirname $(which tree)`
...although it's not clear why this would be desirable.
-
Whenever we want to use output of one process as the options to another, we should think of
xargs
. Becausebooboo
writes to standard error, direct application won't work:vm$ booboo | xargs fixer fe2c245c8cb742e854faef9b7a3970063583b5cd Expected exactly 1 argument but received 0. Usage: fixer "value-from-booboo" Invoke this program with the value that the 'booboo' program writes to the standard error stream. vm$
The first line of the output above is the value from the
booboo
process; this was written to the standard error stream, which was not redirected into the pipeline.xargs
, receiving no input on its standard input stream, simply invokedfixer
without any options.To correct this, we'll need to redirect
fixer
's standard error stream before creating the pipeline:vm$ booboo 2>&1 | xargs fixer You got it!
Using process substituion is also possible, but once again, we'll have to account for the source of the value--redirecting from standard error to standard output:
vm$ fixer $(booboo) fe2c245c8cb742e854faef9b7a3970063583b5cd Expected exactly 1 argument but received 0. Usage: fixer "value-from-booboo" Invoke this program with the value that the 'booboo' program writes to the standard error stream. vm$ fixer $(booboo 2>&1) You got it! vm$
-
In a previous chapter, we discussed redirecting input and output to a file. We can simulate a pipeline by writing one process's output stream to a file, and then using that file as the input stream for a subsequent process.
vm$ ls movies > movies.txt vm$ grep squirrel < movies.txt movies/get-squirrely.mp4 movies/squirrel-boy.mp4 movies/squirrels.mp4
The end result is equivalent, but this approach is less efficient for a few reasons:
- Writing to a file involves transmitting data to the hard drive and waiting for the write operation to complete. Even with today's fancy solid state drives, this takes more time than a pipe (which buffers data in memory).
- The entire directory listing has to be created and stored before the
grep
operation can even begin. Directory listings are typically so small as to not present a problem, but this could be a more severe issue in other applications, where many gigabytes of data may pass through the stream.