Help File Library: Globbing on the Command Line
Written By: Devin Carraway
This article explains the details of referring to multiple files on the
command line using wildcards -- particularly the places where the UNIX
wildcards differ from the simpler wildcards used by DOS and Windows.
Assumptions
This article assumes you know how to access the command line, and are
familiar with the meanings of files and directories. The commands ls(1),
cp(1) and mv(1) (list files, copy files, and move files, respectively) are
used as examples -- it's helpful if you know what they do, or have read
their manpages. The differences between the Linux shell and the DOS
command line are explained, but DOS experience isn't really required.
If you'd like to try out the examples given here, you'll find the touch(1)
and mkdir(1) commands useful -- see their manpages or run 'touch --help' or
'mkdir --help' to find out how to use them. You can use touch to create
empty files with the names you choose, and work through the examples with
them. The mkdir command can be used to make the directories for the copy
and move examples.
The examples given here work with both bash and csh, the most common shells
on Linux systems.
"Globbing" is like a lot of terms you hear from UNIX users -- it sounds
amusing, but it's meaning isn't necessarily apparent. Actually, globbing
isn't something you yourself need to do -- it's something you get to tell
the shell (e.g. bash or csh) to do for you. Globbing is done when you
specify filenames using "wildcard" contractions to refer to more than one
file using a shorter form for them on the commandline (see the Jargon File
definition of "glob" at http://www.tuxedo.org/~esr/jargon/html/entry/glob.html.
If you've used DOS before, you probably remember the expression "*.*"
(usually pronounced "star-dot-star.") That was how you told DOS that
whatever command you'd just given, you wanted it to apply to all the files
in the directory. It had to be written that way because DOS can only
conceive of files with names like MYFILE.TXT -- up to eight letters, then a
period, then up to three letters. It's a limitation that got put there a
long time ago.
Linux, like other kinds of UNIX, doesn't have the kinds of limitations that
DOS does. (You may notice that most of the DOS/Windows history has led to
limitations, like 8.3 filenames, whereas most of the Linux/UNIX heritage has
led to neat tools, like bash. If not, hang on and we'll see.) Under a
Linux shell, we use the * character (usually pronounced as "star") to mean
"anything or nothing" in a filename -- pretty much the same as DOS, but
without that confusing .* thing.
Using characters like * to indicate a list of files is a time-saving
measure; it's useful because it's much easier to write "ls myfile*" than it
is to write "ls myfile1.txt myfile2.txt myfile47.jpeg" and so forth. When
you use a * in that way, it means "all files beginning with 'myfile'" -- or,
put another way, "all files beginning with 'myfile' and having any other
characters, or none, after that." Okay, so the first way is clearer, but
the shell thinks of it the second way. To the shell, "*" means "zero or
more characters." So when you say "ls myfile*", before it runs the ls(1)
command, it goes looking for files whose names begin with "myfile". It
assembles a list of them, and then gives that list to ls -- so even though
you aren't typing "ls myfile1.txt myfile2.txt" and so on, that's what ls
sees -- the shell saved you the work.
By the way, the * in a commandline is called a "meta-character." That's
just a way of saying that it doesn't literally mean an asterisk, but that it
stands for something else. The meta-characters you'll probably find most
useful are, in order, *, ?, [] and {} (ah, you say -- but [] and {} are two
characters each -- true enough, but that will become clearer in a moment).
Because of their special meanings, many programs will try to discourage you
from using those characters when you name files -- because they have special
meanings to the shell, they can be tricky to access when you have to type in
their names.
Unlike DOS, a * can be used anywhere -- at the beginning of a filename, the
end of one, or in the middle. Also unlike DOS, you can use as many *s as
you like. Meta-characters are available to you in most any combination.
We might as well also explain that the ? meta-character means "any single
character." ? matches A, or q, or 6. However, unlike *, it doesn't match
nothing at all -- myfile* would match "myfile" and "myfile2", but myfile?
would only match myfile2.
Let's say you had a lot of files; some where named "letter_to_mom-NNN.txt",
where NNN was some number (you write to Mom a lot). Also suppose you had
some pictures named "photo_for_mom-NNN.jpeg" (again NNN being some number),
and some more files named things like "letter_to_dad-NNN.txt". Now suppose
you wanted to copy (using the cp(1) command) all the letters to a directory
called myletters:
cp letter* myletters/
That one's pretty simple -- when you say letter* to the shell, it means "all
the files whose names begin with 'letter'", and so all the letter* files
will get copied to the myletters directory. Now suppose you were making a
directory of correspondence with Mom -- now you need to copy not only the
letters to her, but also the photos you've been collecting. One simple way
would be:
cp letter_to_mom* mom/
cp photo_for_mom* mom/
... but that's way too much typing. A somewhat shorter way would be:
cp letter_to_mom* photo_for_mom* mom/
... that works too, because cp(1) can copy any number of files at once, so
long as the last thing on the commandline tells cp into what directory you'd
like the files put. But there's an even shorter way:
cp *_mom.* mom/
... this tells cp to copy all the files that have "_mom." in them anywhere
into the mom directory. The _ (underscore) and . are there because they
were used in the original filenames before and after the name -- that way
your correspondence with Cardamom and Electro-Mom won't get mixed in.
Now, let's suppose that in your long-running correspondence with Mom,
sometimes you'd saved your files beginning with "letter_to_mom" and other
times with "letter_to_Mom" -- the difference being that capital M. Now, in
UNIX, filenames are case-sensitive; that means that 'a' and 'A' aren't the
same. If you try to specify the files on the commandline as
'letter_to_mom*', you'll miss the ones that have the capital M. In such a
case, the ? and [] meta-characters are useful:
mv letter_to_?om* mom/
... the ? means that any letter can be there -- either m or M included --
while the *, as before, means "anything or nothing." Thus, the
capitalization problem is quickly avoided. An even neater way to do the
same thing would be this:
mv letter_to_[mM]om* mom/
... in this case, [mM] means "either 'm' or 'M'." It's called a "character
class," or more simply a "character list" (actually, you can call it
whatever you like). When you use [], you can put as many letters in it as
you'd like -- even spaces or most punctuation -- and it will match any of
the letters you've listed. This is useful because it gives you more exact
control -- in the ? example earlier, ? matched M or m in Mom, but it also
would have matched "dom" and "tom" -- whereas [Mm] only matches "Mom" and
"mom", and that's it. You can also use "ranges" with [] -- that's where you
say "any character between these two characters" -- the easy example is
[0-9], which means "any number." You might also see [a-z], which means "any
lowercase letter." Any number of ranges can be included in a [], even
alongside other characters you've put in there -- [a-zA-Z] is another common
one, meaning "any upper- or lower-case letter"; [a-z13579] means "Any
lowercase letter, or the number 1, or 3, or 5, or 7, or 9." You'll probably
find the [] most useful when you want to extract fairly strictly-limited set
of files out of a long list. Returning to our correspondence example, you
might want to get letters to Mom #3 through #6 -- so, you'd use this:
mv letter_to_mom-[3-6].txt mom/
... in this example, the [3-6] means "3, 4, 5 or 6" -- thus
letter_to_mom-4.txt will be moved to the mom directory, but
letter_to_mom-2.txt will not.
Note that we've used - to indicate a range -- while the '-' character
generally acts like any other, inside a [] it's a meta-character -- if you'd
like to use a literal '-' in a range, you can "escape" it with a backslash
(\) character (many places on the commandline and elsewhere, \ means "take
the next letter literally." So if you had two files, "myfile-1" and
"myfile_2", you could match them both with myfile[_\-] -- the _ is a normal
character in the character class, and the \ indicates that the - should be
treated as one also -- that is to say, it isn't being used to indicate a
range, just a normal character.
One other trick about character classes -- they can be "negated" if the
first character is a caret (^). The caret changes the meaning of the class
from "any one of these letters" to "any letter other than these." So, while
[0-9] means "any number," [^0-9] means "anything but a number."
The problem with character classes is that they only refer to a single
letter, and it's often a pain to type in more of them, especially if they're
long and complicated. Often you just want to refer to one of a few
different words, and character classes are unwieldy. That's where the {},
or "alternative list" comes in. {} contains a list of words, separated by
commas, that should appear on any matches. Once again, let's say you have
your letters to Mom and Dad as letter_to_mom-NNN.txt and
letter_to_dad-NNN.txt. Also suppose you have a friend named Dominique, and
her letters are named letter_to_dom-NNN.txt. Now, if you were to use
character classes as above, you might try:
cp letter_to_[md][oa][md]* parents/ # note: this is wrong
... this is somewhat hard to read -- it means "any file whose name begins
with 'letter_to_', and then has either an 'm' or a 'd', then either an 'o'
or an 'a', then either an 'm' or a 'd', then any number of characters" It's
complicated, and beyond that, it doesn't work, because while it will indeed
copy all of the letter_to_mom* and letter_to_dad* files, the character
classes allow letter_to_dom* to match also (d, o and m from each class,
just as d, a and d and m, o and m worked). This is an excellent place to use
{} -- just specify {mom,dad} instead of the messy character classes, and you
have:
cp letter_to_{mom,dad}* parents/
... which is much more readable, and also has a simpler meaning -- "any file
beginning with 'letter_to_', then having either 'mom' or 'dad', then any
number of characters. You're also allowed to use the globbing characters
inside the alternative lists -- for example, suppose you did want to get
your letters to Dominique also:
cp letter_to_{[md]om,dad}* correspondence/
... This is the same as the previous example, except that instead of
matching 'mom', the shell will match either an 'm' or a 'd' followed by
'om'. Likewise, you're allowed to use * in alternative lists. Suppose that
once in a while you'd saved a letter to Dominique as
letter_to_dominique-NNN.txt instead of letter_to_dom-NNN.txt. (Most people
when they create a lot of files wind up trying to use some sort of
consistent scheme for naming them -- and most of those people find
themselves breaking their own scheme sometimes; shells help by making such
inconsistencies easier to cope with). If you wanted to collect your
first few letters to Mom, Dad and Dom, you could use:
cp letter_to_{mom,dom*,dad}-[1-3].txt
... The 'dom*' in the character class means "dom followed by zero or more
characters." You could also have written {mom,dom,dominique,dad}, but this
was terser.
One final quickie for holding out this long -- ~. Almost anytime you move
around a UNIX system, you'll be moving into or out of your home directory
(which is usually something like /home/yourname/). When used at the
beginning of a pathname, ~ means "my home directory." Suppose you had a
directory "myfiles" in your home directory, and wanted to move some files
there from /tmp. If you had already cd'ed to /tmp, you could then move the
files with a command such as:
mv letter_to_* ~/myfiles/
... in this example, ~ is replaced by the shell with the full path to your
home directory (e.g. /home/yourname/). Under the bash and tcsh shells, the
~ haracter can be followed by a username, in which case it will refer to
the home directory of that user, rather than your own home directory. For
example, if you were copying some files from a floppy disk as root to your
normal user home directory, you might use:
cp /mnt/floppy/* ~bob
... when the shell gets this commandline, it replaces ~bob with the path to
bob's home directory (e.g. /home/bob/).
Conclusion
This is pretty much all there is to using shell wildcards -- taken together,
wildcards set a good balance between simplicity and power in identifying
files precisely and quickly according to fairly straightforward rules.
These guidelines are summarized in the "Pathname Expansion" section of the
bash(1) manpage, and the "Filename substitution" section of the tcsh(1)
manpage.
Wildcards are a simplified form of what are called "regular expressions" --
often abbreviated "regexps," these are extremely powerful devices for text
matching (more powerful than are usually required for moving files around).
Regular expressions are very important in some areas of Linux, UNIX and the
Internet, especially if you find yourself learning UNIX or CGI programming.
For more on regexps, see
http://www.tuxedo.org/~esr/jargon/html/entry/regexp.html, the
sed(1) and Perl documentation, or O'Reilly's book, Mastering Regular Expressions.
Now, let's suppose that in your long-running correspondence with Mom,
sometimes you'd saved your files beginning with "letter_to_mom" and other
times with "letter_to_Mom" -- the difference being that capital M. Now, in
UNIX, filenames are case-sensitive; that means that 'a' and 'A' aren't the
same. If you try to specify the files on the commandline as
'letter_to_mom*', you'll miss the ones that have the capital M. In such a
case, the ? and [] meta-characters are useful:
mv letter_to_?om* mom/
... the ? means that any letter can be there -- either m or M included --
while the *, as before, means "anything or nothing." Thus, the
capitalization problem is quickly avoided. An even neater way to do the
same thing would be this:
mv letter_to_[mM]om* mom/
... in this case, [mM] means "either 'm' or 'M'." It's called a "character
class," or more simply a "character list" (actually, you can call it
whatever you like). When you use [], you can put as many letters in it as
you'd like -- even spaces or most punctuation -- and it will match any of
the letters you've listed. This is useful because it gives you more exact
control -- in the ? example earlier, ? matched M or m in Mom, but it also
would have matched "dom" and "tom" -- whereas [Mm] only matches "Mom" and
"mom", and that's it. You can also use "ranges" with [] -- that's where you
say "any character between these two characters" -- the easy example is
[0-9], which means "any number." You might also see [a-z], which means "any
lowercase letter." Any number of ranges can be included in a [], even
alongside other characters you've put in there -- [a-zA-Z] is another common
one, meaning "any upper- or lower-case letter"; [a-z13579] means "Any
lowercase letter, or the number 1, or 3, or 5, or 7, or 9." You'll probably
find the [] most useful when you want to extract fairly strictly-limited set
of files out of a long list. Returning to our correspondence example, you
might want to get letters to Mom #3 through #6 -- so, you'd use this:
mv letter_to_mom-[3-6].txt mom/
... in this example, the [3-6] means "3, 4, 5 or 6" -- thus
letter_to_mom-4.txt will be moved to the mom directory, but
letter_to_mom-2.txt will not.
Note that we've used - to indicate a range -- while the '-' character
generally acts like any other, inside a [] it's a meta-character -- if you'd
like to use a literal '-' in a range, you can "escape" it with a backslash
(\) character (many places on the commandline and elsewhere, \ means "take
the next letter literally." So if you had two files, "myfile-1" and
"myfile_2", you could match them both with myfile[_\-] -- the _ is a normal
character in the character class, and the \ indicates that the - should be
treated as one also -- that is to say, it isn't being used to indicate a
range, just a normal character.
One other trick about character classes -- they can be "negated" if the
first character is a caret (^). The caret changes the meaning of the class
from "any one of these letters" to "any letter other than these." So, while
[0-9] means "any number," [^0-9] means "anything but a number."
The problem with character classes is that they only refer to a single
letter, and it's often a pain to type in more of them, especially if they're
long and complicated. Often you just want to refer to one of a few
different words, and character classes are unwieldy. That's where the {},
or "alternative list" comes in. {} contains a list of words, separated by
commas, that should appear on any matches. Once again, let's say you have
your letters to Mom and Dad as letter_to_mom-NNN.txt and
letter_to_dad-NNN.txt. Also suppose you have a friend named Dominique, and
her letters are named letter_to_dom-NNN.txt. Now, if you were to use
character classes as above, you might try:
cp letter_to_[md][oa][md]* parents/ # note: this is wrong
... this is somewhat hard to read -- it means "any file whose name begins
with 'letter_to_', and then has either an 'm' or a 'd', then either an 'o'
or an 'a', then either an 'm' or a 'd', then any number of characters" It's
complicated, and beyond that, it doesn't work, because while it will indeed
copy all of the letter_to_mom* and letter_to_dad* files, the character
classes allow letter_to_dom* to match also (d, o and m from each class,
just as d, a and d and m, o and m worked). This is an excellent place to use
{} -- just specify {mom,dad} instead of the messy character classes, and you
have:
cp letter_to_{mom,dad}* parents/
... which is much more readable, and also has a simpler meaning -- "any file
beginning with 'letter_to_', then having either 'mom' or 'dad', then any
number of characters. You're also allowed to use the globbing characters
inside the alternative lists -- for example, suppose you did want to get
your letters to Dominique also:
cp letter_to_{[md]om,dad}* correspondence/
... This is the same as the previous example, except that instead of
matching 'mom', the shell will match either an 'm' or a 'd' followed by
'om'. Likewise, you're allowed to use * in alternative lists. Suppose that
once in a while you'd saved a letter to Dominique as
letter_to_dominique-NNN.txt instead of letter_to_dom-NNN.txt. (Most people
when they create a lot of files wind up trying to use some sort of
consistent scheme for naming them -- and most of those people find
themselves breaking their own scheme sometimes; shells help by making such
inconsistencies easier to cope with). If you wanted to collect your
first few letters to Mom, Dad and Dom, you could use:
cp letter_to_{mom,dom*,dad}-[1-3].txt
... The 'dom*' in the character class means "dom followed by zero or more
characters." You could also have written {mom,dom,dominique,dad}, but this
was terser.
One final quickie for holding out this long -- ~. Almost anytime you move
around a UNIX system, you'll be moving into or out of your home directory
(which is usually something like /home/yourname/). When used at the
beginning of a pathname, ~ means "my home directory." Suppose you had a
directory "myfiles" in your home directory, and wanted to move some files
there from /tmp. If you had already cd'ed to /tmp, you could then move the
files with a command such as:
mv letter_to_* ~/myfiles/
... in this example, ~ is replaced by the shell with the full path to your
home directory (e.g. /home/yourname/). Under the bash and tcsh shells, the
~ haracter can be followed by a username, in which case it will refer to
the home directory of that user, rather than your own home directory. For
example, if you were copying some files from a floppy disk as root to your
normal user home directory, you might use:
cp /mnt/floppy/* ~bob
... when the shell gets this commandline, it replaces ~bob with the path to
bob's home directory (e.g. /home/bob/).
Conclusion
This is pretty much all there is to using shell wildcards -- taken together,
wildcards set a good balance between simplicity and power in identifying
files precisely and quickly according to fairly straightforward rules.
These guidelines are summarized in the "Pathname Expansion" section of the
bash(1) manpage, and the "Filename substitution" section of the tcsh(1)
manpage.
Wildcards are a simplified form of what are called "regular expressions" --
often abbreviated "regexps," these are extremely powerful devices for text
matching (more powerful than are usually required for moving files around).
Regular expressions are very important in some areas of Linux, UNIX and the
Internet, especially if you find yourself learning UNIX or CGI programming.
For more on regexps, see
http://www.tuxedo.org/~esr/jargon/html/entry/regexp.html, the
sed(1) and Perl documentation, or O'Reilly's book, Mastering Regular Expressions.