justlinux.com
Mon, 13-Feb-2012 12:35:08 GMT

Forum: Registered Users: 75964, Online: 360
nhfs Here you can view your subscribed threads, work with private messages and edit your profile and preferences Registration is free! Calendar Find other members Frequently Asked Questions Search Home Home

Help File Library: Text Processing Pipelines


Written by Adrian J. Chung

Sure the command line is evil, but mastering it will unlock the powers of a Unix box that remain unrealized under modern graphical user interfaces. This article details the construction of text processing pipelines, using ordinary GNU utilities, to accomplish a few fairly challenging tasks.

Common word usage

Suppose that for whatever reason, one is interested in the word usage of a piece of text, perhaps from an article such as this one, downloaded from the Net. One might want to know what word is most frequently used while ignoring all the non-words, like variable names in source code or other random bits of junk. Perhaps a ranking of word frequency is required. Should one resort to writing a special word counting program in Perl? Here's how to do it using a few GNU text utilities and the assistance of that great resource /usr/dict/words.

First we begin by breaking up the sentences in the text file so that there is no more than one word per line. The "tr" tool is useful here. This tool translates files one character at a time. For example here is the essential USENET tool rot13 using "tr":

% tr a-zA-Z n-za-mN-ZA-M < rot13-encrypted.txt

The two arguments specify the character translation table to use:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM

Characters in the first line are changed to the corresponding character in the second line. Unmatched characters are unaffected. "tr" can be made to reverse this behavior, changing only the unmatched characters:

% tr -c a-zA-Z '\n' < article.txt

Anything that is not a letter is changed to a newline character. This accomplishes the first step of isolating words, one per line.

Next we need to group the similar words together. Sorting the lines of the file will suffice. The "sort" command can also operate in a filter, and with a "-f" option the sort becomes case insensitive. Our pipeline so far:

% tr -c a-zA-Z '\n' < article.txt | sort -f

You will notice that a body of text, such as the article you are reading, will contain many non-words (e.g. "za","tr", "txt"). We can get rid of these by referencing against a list of valid words, /usr/dict/words for example. The "join" command implements what is known as a natural join in database terminology. It reads two text streams which must be pre-sorted by the key field which it uses to match up lines from either stream. Since /usr/dict/words is already sorted this is ideal. What is useful about natural joins is that rows of data that have no matching counterpart in the other text stream are not output. Any words not found in our dictionary are removed thus:

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words -

The "-" tells "join" to take the second text stream from standard input (i.e. the output of the preceding stage of the pipeline). The "-i" makes the join case insensitive.

The next step is to count the grouped words and we can use the "uniq" utility to do the job. "uniq" removes repeated lines of text that are consecutively located. With a "-c" option "uniq" will also output a count of similar lines for each unique line found. Again, "-i" for case insensitivity:

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - | uniq -i -c

Finally, this output needs to be sorted by frequency using "sort":

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - | uniq -i -c | sort -r

Normally when sorting by numerical value rather than ASCII string, the "-n" option should be given. We can get away without it because "uniq" right-justifies its numerical counts. The "-r" reverses the order of the sort so that the most frequently used words appear first.

There are still a few shortcomings such as the handling of contractions (e.g. "we'll", "it's", "can't") but one could expect similar difficulties with a specially coded word counting program.

Unordered Natural Joins

On more than one occasion I have had the need to fuse two files together so that the combined information contains the same info of the two files separately. For example suppose we have the following two files:

alpha.txt

tetex-xdvi 1.0.6-11
ElectricFence 2.1-3
newt-devel 0.50.8-2
rgrep 0.98.7-5
dosfstools 2.2-4
bdflush 1.5-11
bin86 0.4-7
gnuplot 3.7.1-3
dialog 0.6-16
kernel-utils 2.2.14-5.0
termcap 10.2.7-9

beta.txt

ElectricFence 36903
bdflush 8861
bin86 74968
dialog 85955
dosfstools 106819
gnuplot 1345702
kernel-utils 292693
newt-devel 144815
rgrep 15202
termcap 625272
tetex-xdvi 1222425

A natural "join" suggests itself, however, suppose one needs to keep the lines of text ordered as they are found in the alpha.txt file. "join" will fail unless the key field is sorted. We need to add an extra indexing field to the first file that will help to restore the original order of the file. The "nl" command proves to be useful:

% nl alpha.txt

1 tetex-xdvi 1.0.6-11
2 ElectricFence 2.1-3
3 newt-devel 0.50.8-2
4 rgrep 0.98.7-5
5 dosfstools 2.2-4
6 bdflush 1.5-11
7 bin86 0.4-7
8 gnuplot 3.7.1-3
9 dialog 0.6-16
10 kernel-utils 2.2.14-5.0
11 termcap 10.2.7-9

Now we can sort by the key field to perform the join, then re-order by our index field to restore the original order:

% nl alpha.txt |sort +1 |join -j2 2 beta.txt -

The "-j2 2" argument tells "join" to use the 2nd field as the key field for the second input stream. (beta.txt is the first input
stream)T. The output is a bit messy but one observes that the index to sort by is the 3rd field. "sort +2" will skip over the first two fields when comparing rows:

% nl alpha.txt |sort +1 |join -j2 2 beta.txt - |sort +2 -n

Now get rid of the ordering field using "cut"

% nl alpha.txt |sort +1 |join -j2 2 beta.txt -| sort +2 -n | cut -f 1,2,4 -d " "

The "-f" argument gives a list of fields to include in the output. "cut" normally uses [TAB] as the field separator but this is changed using "-d". We still need to pretty up the formatting. The "pr" tool finds a use here:

% nl alpha.txt |sort +1 |join -j2 2 beta.txt -| sort +2 -n | cut -f 1,2,4 -d " " |pr -e\ 16 -T

tetex-xdvi 1222425 1.0.6-11
ElectricFence 36903 2.1-3
newt-devel 144815 0.50.8-2
rgrep 15202 0.98.7-5
dosfstools 106819 2.2-4
bdflush 8861 1.5-11
bin86 74968 0.4-7
gnuplot 1345702 3.7.1-3
dialog 85955 0.6-16
kernel-utils 292693 2.2.14-5.0
termcap 625272 10.2.7-9

"pr" normally formats output for line printers. Few people use these archaic pieces of hardware anymore but "pr" still has its uses. The "-e" expands TAB characters, replacing them with spaces. Our slightly embellished "-e" argument tells "pr" to use a single space as the TAB character and to space the tab positions 16 character widths apart. "-T" suppresses the headers, footers, and form feeds.

Conclusion

The GNU text processing utilities provide a rich set of functions that can be combined, using the command line, in many different ways to accomplish tasks that otherwise would require special programs to be written. Investing a little time to learn to use the command line interface can save one a great deal of trouble in the long run.