Good evening,
The forum is always my last stop. I 've been trying to sort an alphabetacal
list ,and put the sorted list back into the same file with a one line command.
Such as....$ sort file 1 > file 1. This gives me a blank page for an output.
Why does the command ( sort file 1 > file 1) do this? I read that the sort command empties the first file then sorts. That doesn't make sense.
Of course I can write......sort file 1 > file 2 with no problems,Or I can sort
the list using GUI (gedit),the sort plug-in,and the name of the file will stay the same.
However ,it seems, that can't be done in the terminal..I need some explanation please.
Thanks Rich
paj12
01-16-2009, 12:33 AM
The problem is that the > operator blanks out the file before the sort command has a chance to read its contents. You can fix this by reading the file with the cat command first, then piping the contents to the sort command. Like this:
cat file1 | sort > file1
hlrguy
01-16-2009, 12:46 AM
You probably want to avoid complicated sorting programs that don't preserve or back up the input file. Using file1 into file1 is fine for all intermediate steps, however, I recommend your script always cp file1 file1_$date first.
Take it from someone who has overwritten the original file in strange ways more than I can count while learning. :)
hlrguy
kaykav
01-16-2009, 10:44 AM
thank you..rich
kaykav
01-16-2009, 10:55 AM
paj12,
How does a file get sorted ,without getting blanked-out, in this instance: $ sort file 1> file 2
It looks like the file 1 would be blanked-out before being sorted and moved to file 2...
hlrguy
01-16-2009, 02:26 PM
Sort does not modify any file. If you simply
sort file1
you would see it sent to std out (console you are in) sorted but the original file is unchanged.
hlrguy
kaykav
01-16-2009, 07:04 PM
So your saying that there is no way to sort an existing file and keeping the same file name (with one command) ? If I could, I could ( $cat file ), and the result would be a sorted file. I know this seems nit-picky, but I wanted to be set straight on this subject.
thank you ....Rich
hlrguy
01-16-2009, 07:32 PM
Well, turns out you can.
sort file1 -o file1
man sort
hlrguy
lugoteehalt
01-17-2009, 07:47 AM
The problem is that the > operator blanks out the file before the sort command has a chance to read its contents. You can fix this by reading the file with the cat command first, then piping the contents to the sort command. Like this:
cat file1 | sort > file1This is probably gonads but doesn't this command require that the noclobber option be not set? Otherwise it'll say 'Cannot overwrite existing file.' Very good idea to set the noclobber option I trow.
hlrguy
01-17-2009, 03:18 PM
If you look at the command
cat file1 | sort > file1
if is really two commands. You cat file1, that will complete before the "pipe" occurs, where the second command, the sort pushed ">" into file1 will happen, however the cat has completed.
hlrguy
kaykav
01-17-2009, 10:01 PM
Ahh, that did it!! The command would be.... $ sort file1 -o file1.
Thank you very much..Rich
bwkaz
01-19-2009, 02:24 AM
If you look at the command
cat file1 | sort > file1
if is really two commands. Yes, but...
You cat file1, that will complete before the "pipe" occurs, No it won't. It's a race condition.
The shell sets up the entire pipeline, then lets all the processes run to completion, then waits for all of them. It sees the pipe character, and creates an anonymous pipe (using the pipe(2) system call). It then forks twice (once for the command before the pipe, and once for the command after). Both children proceed in parallel after this.
The first child will close the read half of the pipe (since it doesn't need that), then set its stdout to the write half of the pipe. Then it exec()s "cat file1", which opens file1, reads all its contents, and dumps the contents to stdout (which is now the write half of the pipe).
At the same time, the second child will close the write half of the pipe (since it doesn't need that), and set its stdin to the read half. The shell also sees the redirect-to-file1, so the second child opens and truncates file1, and sets its stdout to that file handle. Then the second child exec()s sort.
If the second child happens to truncate the file before the first child's cat process opens it and reads all of it, then you will either sort only part of the file, or you'll sort nothing.
where the second command, the sort pushed ">" into file1 will happen, however the cat has completed. No, see above. The first and second child execute in parallel; you have no guarantee which process will get to execute first, or which will finish first.
hlrguy
01-19-2009, 03:06 AM
bwkaz, I disagree. bash executes cat in it's entirely, buffering the entire cat results and then redirects to the pipe. I tried (unsucessfully) to find the buffer size of cat when piping using bash to set up an experiment when file1 is not touched until the cat command completes (or in the case of too big, the command halts, never getting to the pipe to sort). I didn't feel like making a 10 Mbyte text file though, lol.
Maybe this is only true of Solaris and there are parallels in Linux, but long term experience with pipes, when the part before the pipe fails, you never get even partial results you would expect from parallel after the pipe acting on the data from the first part being fed into the pipe.
OK, that is clear as mud. Assume I had a 10mbyte file and cat chokes at 8 mbytes. You would expect 8 mbytes of sorted data in file1 above, but you won't, you will only have the std error from the failed cat pushed into file 1.
Using code, you are thinking this might apply? (I named the "buffer" /tmp/cow)
> I didn't feel like making a 10 Mbyte text file though, lol.
Then I will.furrycat@zombiehunter:/tmp$ ls -l guineapig
-rw-r--r-- 1 furrycat users 11931793 2009-01-19 09:57 guineapig
furrycat@zombiehunter:/tmp$ wc -l < guineapig
45578
furrycat@zombiehunter:/tmp$ cat guineapig | sort > guineapig
furrycat@zombiehunter:/tmp$ ls -l guineapig
-rw-r--r-- 1 furrycat users 0 2009-01-19 09:58 guineapig
Exactly as bwkaz described.
paj12
01-19-2009, 01:47 PM
Exactly as bwkaz described.
Right. The command I posted only works on trivially small files, like a couple of lines. I didn't test it on anything bigger than that. Sorry. :rolleyes:
cybertron
01-19-2009, 03:03 PM
I've actually run into problems piping between applications where one outputs a large amount of data and blocks before it finishes (I assume because whatever buffer the pipe is using has been filled) and the other is waiting for the first to finish before it starts reading, so I end up deadlocked. Note that this was in Python not Bash, but they would use the same underlying API's so I think the same could happen if Bash tried to allow the command before the pipe to complete before sending the data.
bwkaz
01-20-2009, 12:53 AM
bash executes cat in it's entirely, buffering the entire cat results and then redirects to the pipe. Nope. See the bash sources (since the manpage doesn't specify). From bash-3.1.17, in execute_cmd.c, see execute_disk_command. Note that this function is called once for each section of the pipeline, in the parent shell process: it's called only from execute_simple_command, which is called only from execute_command_internal; that function is called for each command in the pipeline, from execute_pipeline (and note also that execute_pipeline is what creates the pipe FD-pair for each | character in the command).
Anyway, in here, comments along the left margin are mine; others were in the original source. execute_disk_command:
/* If we can get away without forking and there are no pipes to deal with,
don't bother to fork, just directly exec the command. */
if (nofork && pipe_in == NO_PIPE && pipe_out == NO_PIPE)
pid = 0;
else
pid = make_child (savestring (command_line), async);
/* note that make_child returns the result of fork(), i.e. zero in the child */
if (pid == 0)
{
/* omitted */
/* sets pipe FDs to stdin/stdout, as needed */
do_piping (pipe_in, pipe_out);
/* omitted */
if (redirects && (do_redirections (redirects, RX_ACTIVE) != 0))
{
/* error handling omitted */
/* do_redirections is what opens the file, truncating it.
trace through do_redirection_internal into redir_open.
all of those are in redir.c.
note that we're still in the child process, unsynchronized. ;) */
}
/* more stuff omitted */
args = strvec_from_word_list (words, 0, 0, (int *)NULL);
exit (shell_execve (command, args, export_env));
}
/* remainder of function (cleanup in parent) omitted */ As above, the shell creates the actual pipe FD-pair(s) before calling this function (in execute_pipeline). But execute_disk_command calls into other functions that fork, and only in the child do the redirections happen (opening the file, which truncates it), if needed.
That means that the final "> file1" happens in the last child process in the pipeline. There is no guarantee here that the first cat has finished by then: it may have, if it happens to schedule as soon as it gets forked, never gets put to sleep while exec()ing, and runs to completion, all within a single timeslice. Or it may not have, if the parent shell kept running until it forked the last child, then that child happened to schedule, and ran until it got to the redirect_open function. Which of the above happens is random (and moreso on a multi-processor machine).
Maybe this is only true of Solaris and there are parallels in Linux, but long term experience with pipes, when the part before the pipe fails, you never get even partial results you would expect from parallel after the pipe acting on the data from the first part being fed into the pipe.
OK, that is clear as mud. Assume I had a 10mbyte file and cat chokes at 8 mbytes. You would expect 8 mbytes of sorted data in file1 above, but you won't, you will only have the std error from the failed cat pushed into file 1. Um, you won't ever get stderr, since you're only redirecting stdout. But, you might get 8M of the file (pipes only have a finite, ~4K buffer, and that buffer is in the kernel: if you try to write >4K to a pipe, you get put to sleep until the other end reads from that pipe). Or you might not get exactly 8M of the file. (Now another race condition is happening: writing the file versus delivering the signal that the shell uses to shut down the pipeline. Which, I should note, will not happen if the first command exits with a failure status; it will only happen in some kind of catastrophic case, like ctrl+c or something. If the first command exits with a failure status, then its write end of the pipe merely gets closed, which the next command will see as EOF on its read-end.)
But that's all irrelevant: see the source above for what bash actually does. Don't rely on what you've seen it do in the past: you can't ever test code for complete correctness in the face of multithreading / multiprocessing. Especially if you have more than one CPU running the code.
Using code, you are thinking this might apply? (I named the "buffer" /tmp/cow)
cat file1 | /tmp/cow &;sort /tmp/cow > file1
I don't think, conceptually the above is what
cat file1 | sort file1 means I assume you meant this:
cat file1 > /tmp/cow &;sort /tmp/cow > file1 since /tmp/cow is not executable. :p
But apart from that: no, what you wrote is not exactly what happens either. The way you did it, sort might hit EOF in /tmp/cow before cat has finished (if it happens to read faster than cat is writing). With a pipe, there is a strict ordering guarantee regarding data flowing through the pipe: data that gets written will be read on the other end, in the same order as it was written. You also don't get EOF on the reader end of a pipe until the writer end is closed.
But those guarantees are as far as it goes. You don't have guarantees related to other redirections (or other pipes, even: if data is introduced into the middle of a pipeline, then it may hit the end of the pipeline before the first part of the data gets to the end; it depends on what the intermediate processes are doing).
And the processes are definitely both happening in parallel. (See the source for make_child in bash: jobs.c. Note the call to fork(). :))
hlrguy
01-20-2009, 04:34 AM
I assume you meant this:
cat file1 > /tmp/cow &;sort /tmp/cow > file1 since /tmp/cow is not executable. :p
But those guarantees are as far as it goes. You don't have guarantees related to other redirections
Yes, I did mean the way you corrected my example, lol. Trying to conceptualize the pipe concept.
http://www.linuxjournal.com/article/2156
When bash examines the command line, it finds the vertical bar character | that separates the two commands. Bash and other shells run both commands, connecting the output of the first to the input of the second.
On screen, it will not appear that anything is happening, but if you run top (a command similar to ps for showing process status), you'll see that both cat programs are running like crazy copying the letter x back and forth in an endless loop.
The > back into the original file name even though run second overwrites the original too fast. Anyone got an 8086 laying around to try this? :D
Anyway, all moot since
sort file1 -o file1
works, however, is there a size limit.
hlrguy
furrycat
01-20-2009, 07:03 AM
Quoth hlrguy,sort file1 -o file1works, however, is there a size limit[?]
GNU sort will use temporary files if there isn't enough memory to sort the file in one shot. Any file too big to fit in virtual memory is plenty big so I wouldn't worry about limits...
lugoteehalt
01-20-2009, 01:05 PM
This is probably gonads but doesn't this command require that the noclobber option be not set? Otherwise it'll say 'Cannot overwrite existing file.' Very good idea to set the noclobber option I trow.lugo@fido:~$ cat testfile.txt
b
a
r
w
a
s
u
i
e
k
lugo@fido:~$ cat testfile.txt|sort > testfile.txt
bash: testfile.txt: cannot overwrite existing file
lugo@fido:~$ set|grep noclobber
SHELLOPTS=braceexpand:emacs:hashall:histexpand:int eractive-comments:monitor:noclobber
You lot must be a bunch of maniacs if you don't use noclobber.:)
hlrguy
01-20-2009, 02:15 PM
You lot must be a bunch of maniacs if you don't use noclobber
Life on the edge Dude! :D
hlrguy
justlinux.com
Copyright Internet.com Inc. All Rights Reserved.