justlinux.com
Mon, 13-Feb-2012 12:34:50 GMT

Forum: Registered Users: 75964, Online: 329
nhfs Here you can view your subscribed threads, work with private messages and edit your profile and preferences Registration is free! Calendar Find other members Frequently Asked Questions Search Home Home

Help File Library: Globbing on the Command Line

Written By: Devin Carraway

This article explains the details of referring to multiple files on the command line using wildcards -- particularly the places where the UNIX wildcards differ from the simpler wildcards used by DOS and Windows.

Assumptions

This article assumes you know how to access the command line, and are familiar with the meanings of files and directories. The commands ls(1), cp(1) and mv(1) (list files, copy files, and move files, respectively) are used as examples -- it's helpful if you know what they do, or have read their manpages. The differences between the Linux shell and the DOS command line are explained, but DOS experience isn't really required.

If you'd like to try out the examples given here, you'll find the touch(1) and mkdir(1) commands useful -- see their manpages or run 'touch --help' or 'mkdir --help' to find out how to use them. You can use touch to create empty files with the names you choose, and work through the examples with them. The mkdir command can be used to make the directories for the copy and move examples.

The examples given here work with both bash and csh, the most common shells on Linux systems.

"Globbing" is like a lot of terms you hear from UNIX users -- it sounds amusing, but it's meaning isn't necessarily apparent. Actually, globbing isn't something you yourself need to do -- it's something you get to tell the shell (e.g. bash or csh) to do for you. Globbing is done when you specify filenames using "wildcard" contractions to refer to more than one file using a shorter form for them on the commandline (see the Jargon File definition of "glob" at http://www.tuxedo.org/~esr/jargon/html/entry/glob.html.

If you've used DOS before, you probably remember the expression "*.*" (usually pronounced "star-dot-star.") That was how you told DOS that whatever command you'd just given, you wanted it to apply to all the files in the directory. It had to be written that way because DOS can only conceive of files with names like MYFILE.TXT -- up to eight letters, then a period, then up to three letters. It's a limitation that got put there a long time ago.

Linux, like other kinds of UNIX, doesn't have the kinds of limitations that DOS does. (You may notice that most of the DOS/Windows history has led to limitations, like 8.3 filenames, whereas most of the Linux/UNIX heritage has led to neat tools, like bash. If not, hang on and we'll see.) Under a Linux shell, we use the * character (usually pronounced as "star") to mean "anything or nothing" in a filename -- pretty much the same as DOS, but without that confusing .* thing.

Using characters like * to indicate a list of files is a time-saving measure; it's useful because it's much easier to write "ls myfile*" than it is to write "ls myfile1.txt myfile2.txt myfile47.jpeg" and so forth. When you use a * in that way, it means "all files beginning with 'myfile'" -- or, put another way, "all files beginning with 'myfile' and having any other characters, or none, after that." Okay, so the first way is clearer, but the shell thinks of it the second way. To the shell, "*" means "zero or more characters." So when you say "ls myfile*", before it runs the ls(1) command, it goes looking for files whose names begin with "myfile". It assembles a list of them, and then gives that list to ls -- so even though you aren't typing "ls myfile1.txt myfile2.txt" and so on, that's what ls sees -- the shell saved you the work.

By the way, the * in a commandline is called a "meta-character." That's just a way of saying that it doesn't literally mean an asterisk, but that it stands for something else. The meta-characters you'll probably find most useful are, in order, *, ?, [] and {} (ah, you say -- but [] and {} are two characters each -- true enough, but that will become clearer in a moment). Because of their special meanings, many programs will try to discourage you from using those characters when you name files -- because they have special meanings to the shell, they can be tricky to access when you have to type in their names.

Unlike DOS, a * can be used anywhere -- at the beginning of a filename, the end of one, or in the middle. Also unlike DOS, you can use as many *s as you like. Meta-characters are available to you in most any combination. We might as well also explain that the ? meta-character means "any single character." ? matches A, or q, or 6. However, unlike *, it doesn't match nothing at all -- myfile* would match "myfile" and "myfile2", but myfile? would only match myfile2.

Let's say you had a lot of files; some where named "letter_to_mom-NNN.txt", where NNN was some number (you write to Mom a lot). Also suppose you had some pictures named "photo_for_mom-NNN.jpeg" (again NNN being some number), and some more files named things like "letter_to_dad-NNN.txt". Now suppose you wanted to copy (using the cp(1) command) all the letters to a directory called myletters:

cp letter* myletters/

That one's pretty simple -- when you say letter* to the shell, it means "all the files whose names begin with 'letter'", and so all the letter* files will get copied to the myletters directory. Now suppose you were making a directory of correspondence with Mom -- now you need to copy not only the letters to her, but also the photos you've been collecting. One simple way would be:

cp letter_to_mom* mom/
cp photo_for_mom* mom/

... but that's way too much typing. A somewhat shorter way would be:

cp letter_to_mom* photo_for_mom* mom/

... that works too, because cp(1) can copy any number of files at once, so long as the last thing on the commandline tells cp into what directory you'd like the files put. But there's an even shorter way:

cp *_mom.* mom/

... this tells cp to copy all the files that have "_mom." in them anywhere into the mom directory. The _ (underscore) and . are there because they were used in the original filenames before and after the name -- that way your correspondence with Cardamom and Electro-Mom won't get mixed in.

Now, let's suppose that in your long-running correspondence with Mom, sometimes you'd saved your files beginning with "letter_to_mom" and other times with "letter_to_Mom" -- the difference being that capital M. Now, in UNIX, filenames are case-sensitive; that means that 'a' and 'A' aren't the same. If you try to specify the files on the commandline as 'letter_to_mom*', you'll miss the ones that have the capital M. In such a case, the ? and [] meta-characters are useful:

mv letter_to_?om* mom/

... the ? means that any letter can be there -- either m or M included -- while the *, as before, means "anything or nothing." Thus, the capitalization problem is quickly avoided. An even neater way to do the same thing would be this:

mv letter_to_[mM]om* mom/

... in this case, [mM] means "either 'm' or 'M'." It's called a "character class," or more simply a "character list" (actually, you can call it whatever you like). When you use [], you can put as many letters in it as you'd like -- even spaces or most punctuation -- and it will match any of the letters you've listed. This is useful because it gives you more exact control -- in the ? example earlier, ? matched M or m in Mom, but it also would have matched "dom" and "tom" -- whereas [Mm] only matches "Mom" and "mom", and that's it. You can also use "ranges" with [] -- that's where you say "any character between these two characters" -- the easy example is [0-9], which means "any number." You might also see [a-z], which means "any lowercase letter." Any number of ranges can be included in a [], even alongside other characters you've put in there -- [a-zA-Z] is another common one, meaning "any upper- or lower-case letter"; [a-z13579] means "Any lowercase letter, or the number 1, or 3, or 5, or 7, or 9." You'll probably find the [] most useful when you want to extract fairly strictly-limited set of files out of a long list. Returning to our correspondence example, you might want to get letters to Mom #3 through #6 -- so, you'd use this:

mv letter_to_mom-[3-6].txt mom/

... in this example, the [3-6] means "3, 4, 5 or 6" -- thus letter_to_mom-4.txt will be moved to the mom directory, but letter_to_mom-2.txt will not.

Note that we've used - to indicate a range -- while the '-' character generally acts like any other, inside a [] it's a meta-character -- if you'd like to use a literal '-' in a range, you can "escape" it with a backslash (\) character (many places on the commandline and elsewhere, \ means "take the next letter literally." So if you had two files, "myfile-1" and "myfile_2", you could match them both with myfile[_\-] -- the _ is a normal character in the character class, and the \ indicates that the - should be treated as one also -- that is to say, it isn't being used to indicate a range, just a normal character.

One other trick about character classes -- they can be "negated" if the first character is a caret (^). The caret changes the meaning of the class from "any one of these letters" to "any letter other than these." So, while [0-9] means "any number," [^0-9] means "anything but a number."

The problem with character classes is that they only refer to a single letter, and it's often a pain to type in more of them, especially if they're long and complicated. Often you just want to refer to one of a few different words, and character classes are unwieldy. That's where the {}, or "alternative list" comes in. {} contains a list of words, separated by commas, that should appear on any matches. Once again, let's say you have your letters to Mom and Dad as letter_to_mom-NNN.txt and letter_to_dad-NNN.txt. Also suppose you have a friend named Dominique, and her letters are named letter_to_dom-NNN.txt. Now, if you were to use character classes as above, you might try:

cp letter_to_[md][oa][md]* parents/ # note: this is wrong

... this is somewhat hard to read -- it means "any file whose name begins with 'letter_to_', and then has either an 'm' or a 'd', then either an 'o' or an 'a', then either an 'm' or a 'd', then any number of characters" It's complicated, and beyond that, it doesn't work, because while it will indeed copy all of the letter_to_mom* and letter_to_dad* files, the character classes allow letter_to_dom* to match also (d, o and m from each class, just as d, a and d and m, o and m worked). This is an excellent place to use {} -- just specify {mom,dad} instead of the messy character classes, and you have:

cp letter_to_{mom,dad}* parents/

... which is much more readable, and also has a simpler meaning -- "any file beginning with 'letter_to_', then having either 'mom' or 'dad', then any number of characters. You're also allowed to use the globbing characters inside the alternative lists -- for example, suppose you did want to get your letters to Dominique also:

cp letter_to_{[md]om,dad}* correspondence/

... This is the same as the previous example, except that instead of matching 'mom', the shell will match either an 'm' or a 'd' followed by 'om'. Likewise, you're allowed to use * in alternative lists. Suppose that once in a while you'd saved a letter to Dominique as letter_to_dominique-NNN.txt instead of letter_to_dom-NNN.txt. (Most people when they create a lot of files wind up trying to use some sort of consistent scheme for naming them -- and most of those people find themselves breaking their own scheme sometimes; shells help by making such inconsistencies easier to cope with). If you wanted to collect your first few letters to Mom, Dad and Dom, you could use:

cp letter_to_{mom,dom*,dad}-[1-3].txt

... The 'dom*' in the character class means "dom followed by zero or more characters." You could also have written {mom,dom,dominique,dad}, but this was terser.

One final quickie for holding out this long -- ~. Almost anytime you move around a UNIX system, you'll be moving into or out of your home directory (which is usually something like /home/yourname/). When used at the beginning of a pathname, ~ means "my home directory." Suppose you had a directory "myfiles" in your home directory, and wanted to move some files there from /tmp. If you had already cd'ed to /tmp, you could then move the files with a command such as:

mv letter_to_* ~/myfiles/

... in this example, ~ is replaced by the shell with the full path to your home directory (e.g. /home/yourname/). Under the bash and tcsh shells, the ~ haracter can be followed by a username, in which case it will refer to the home directory of that user, rather than your own home directory. For example, if you were copying some files from a floppy disk as root to your normal user home directory, you might use:

cp /mnt/floppy/* ~bob

... when the shell gets this commandline, it replaces ~bob with the path to bob's home directory (e.g. /home/bob/).

Conclusion

This is pretty much all there is to using shell wildcards -- taken together, wildcards set a good balance between simplicity and power in identifying files precisely and quickly according to fairly straightforward rules.

These guidelines are summarized in the "Pathname Expansion" section of the bash(1) manpage, and the "Filename substitution" section of the tcsh(1) manpage.

Wildcards are a simplified form of what are called "regular expressions" -- often abbreviated "regexps," these are extremely powerful devices for text matching (more powerful than are usually required for moving files around). Regular expressions are very important in some areas of Linux, UNIX and the Internet, especially if you find yourself learning UNIX or CGI programming. For more on regexps, see http://www.tuxedo.org/~esr/jargon/html/entry/regexp.html, the sed(1) and Perl documentation, or O'Reilly's book, Mastering Regular Expressions.

Now, let's suppose that in your long-running correspondence with Mom, sometimes you'd saved your files beginning with "letter_to_mom" and other times with "letter_to_Mom" -- the difference being that capital M. Now, in UNIX, filenames are case-sensitive; that means that 'a' and 'A' aren't the same. If you try to specify the files on the commandline as 'letter_to_mom*', you'll miss the ones that have the capital M. In such a case, the ? and [] meta-characters are useful:

mv letter_to_?om* mom/

... the ? means that any letter can be there -- either m or M included -- while the *, as before, means "anything or nothing." Thus, the capitalization problem is quickly avoided. An even neater way to do the same thing would be this:

mv letter_to_[mM]om* mom/

... in this case, [mM] means "either 'm' or 'M'." It's called a "character class," or more simply a "character list" (actually, you can call it whatever you like). When you use [], you can put as many letters in it as you'd like -- even spaces or most punctuation -- and it will match any of the letters you've listed. This is useful because it gives you more exact control -- in the ? example earlier, ? matched M or m in Mom, but it also would have matched "dom" and "tom" -- whereas [Mm] only matches "Mom" and "mom", and that's it. You can also use "ranges" with [] -- that's where you say "any character between these two characters" -- the easy example is [0-9], which means "any number." You might also see [a-z], which means "any lowercase letter." Any number of ranges can be included in a [], even alongside other characters you've put in there -- [a-zA-Z] is another common one, meaning "any upper- or lower-case letter"; [a-z13579] means "Any lowercase letter, or the number 1, or 3, or 5, or 7, or 9." You'll probably find the [] most useful when you want to extract fairly strictly-limited set of files out of a long list. Returning to our correspondence example, you might want to get letters to Mom #3 through #6 -- so, you'd use this:

mv letter_to_mom-[3-6].txt mom/

... in this example, the [3-6] means "3, 4, 5 or 6" -- thus letter_to_mom-4.txt will be moved to the mom directory, but letter_to_mom-2.txt will not.

Note that we've used - to indicate a range -- while the '-' character generally acts like any other, inside a [] it's a meta-character -- if you'd like to use a literal '-' in a range, you can "escape" it with a backslash (\) character (many places on the commandline and elsewhere, \ means "take the next letter literally." So if you had two files, "myfile-1" and "myfile_2", you could match them both with myfile[_\-] -- the _ is a normal character in the character class, and the \ indicates that the - should be treated as one also -- that is to say, it isn't being used to indicate a range, just a normal character.

One other trick about character classes -- they can be "negated" if the first character is a caret (^). The caret changes the meaning of the class from "any one of these letters" to "any letter other than these." So, while [0-9] means "any number," [^0-9] means "anything but a number."

The problem with character classes is that they only refer to a single letter, and it's often a pain to type in more of them, especially if they're long and complicated. Often you just want to refer to one of a few different words, and character classes are unwieldy. That's where the {}, or "alternative list" comes in. {} contains a list of words, separated by commas, that should appear on any matches. Once again, let's say you have your letters to Mom and Dad as letter_to_mom-NNN.txt and letter_to_dad-NNN.txt. Also suppose you have a friend named Dominique, and her letters are named letter_to_dom-NNN.txt. Now, if you were to use character classes as above, you might try:

cp letter_to_[md][oa][md]* parents/ # note: this is wrong

... this is somewhat hard to read -- it means "any file whose name begins with 'letter_to_', and then has either an 'm' or a 'd', then either an 'o' or an 'a', then either an 'm' or a 'd', then any number of characters" It's complicated, and beyond that, it doesn't work, because while it will indeed copy all of the letter_to_mom* and letter_to_dad* files, the character classes allow letter_to_dom* to match also (d, o and m from each class, just as d, a and d and m, o and m worked). This is an excellent place to use {} -- just specify {mom,dad} instead of the messy character classes, and you have:

cp letter_to_{mom,dad}* parents/

... which is much more readable, and also has a simpler meaning -- "any file beginning with 'letter_to_', then having either 'mom' or 'dad', then any number of characters. You're also allowed to use the globbing characters inside the alternative lists -- for example, suppose you did want to get your letters to Dominique also:

cp letter_to_{[md]om,dad}* correspondence/

... This is the same as the previous example, except that instead of matching 'mom', the shell will match either an 'm' or a 'd' followed by 'om'. Likewise, you're allowed to use * in alternative lists. Suppose that once in a while you'd saved a letter to Dominique as letter_to_dominique-NNN.txt instead of letter_to_dom-NNN.txt. (Most people when they create a lot of files wind up trying to use some sort of consistent scheme for naming them -- and most of those people find themselves breaking their own scheme sometimes; shells help by making such inconsistencies easier to cope with). If you wanted to collect your first few letters to Mom, Dad and Dom, you could use:

cp letter_to_{mom,dom*,dad}-[1-3].txt

... The 'dom*' in the character class means "dom followed by zero or more characters." You could also have written {mom,dom,dominique,dad}, but this was terser.

One final quickie for holding out this long -- ~. Almost anytime you move around a UNIX system, you'll be moving into or out of your home directory (which is usually something like /home/yourname/). When used at the beginning of a pathname, ~ means "my home directory." Suppose you had a directory "myfiles" in your home directory, and wanted to move some files there from /tmp. If you had already cd'ed to /tmp, you could then move the files with a command such as:

mv letter_to_* ~/myfiles/

... in this example, ~ is replaced by the shell with the full path to your home directory (e.g. /home/yourname/). Under the bash and tcsh shells, the ~ haracter can be followed by a username, in which case it will refer to the home directory of that user, rather than your own home directory. For example, if you were copying some files from a floppy disk as root to your normal user home directory, you might use:

cp /mnt/floppy/* ~bob

... when the shell gets this commandline, it replaces ~bob with the path to bob's home directory (e.g. /home/bob/).

Conclusion

This is pretty much all there is to using shell wildcards -- taken together, wildcards set a good balance between simplicity and power in identifying files precisely and quickly according to fairly straightforward rules.

These guidelines are summarized in the "Pathname Expansion" section of the bash(1) manpage, and the "Filename substitution" section of the tcsh(1) manpage.

Wildcards are a simplified form of what are called "regular expressions" -- often abbreviated "regexps," these are extremely powerful devices for text matching (more powerful than are usually required for moving files around). Regular expressions are very important in some areas of Linux, UNIX and the Internet, especially if you find yourself learning UNIX or CGI programming. For more on regexps, see http://www.tuxedo.org/~esr/jargon/html/entry/regexp.html, the sed(1) and Perl documentation, or O'Reilly's book, Mastering Regular Expressions.