Click to See Complete Forum and Search --> : text processing question


bhendry
12-11-2002, 01:33 AM
I'm sure this is simple, but I can't find anything specific in the nhf pages, google, etc. on how to do this...

I have a large number of text files that need some cleaning up, and I'm sure there's a way to do this with sed, but I can't figure it out.

Each line of the file has 3 leading spaces that I'd like to remove. All of the paragraphs are broken up with linefeeds as well, and I'd like to join those lines while leaving the paragraphs separated by double-spaces (which they have already). Here's a sample of the formatting, so you'll know what I mean...

Each paragraph is indented
like this. Also, each paragraph
is separated by an empty line.

Blah blah blah, blah blah blah
blah blah blah. Blah blah blah...

bhendry
12-11-2002, 01:36 AM
Well, my example didn't work as expected, as the leading spaces are apparently filtered out by the forum. DOH!

OK - here's the example again, only with periods instead of the spaces in my actual files.

...Each paragraph is indented
...like this. Also, each paragraph
...is separated by an empty line.

...Blah blah blah, blah blah blah
...blah blah blah. Blah blah blah...

with.a.twist
12-11-2002, 07:58 AM
This is kinda simple, but it works:

cut -c 4-100 old_file | grep "^." > new_filename

O.k. This command says to print to a buffer only chars 4 through 100, then print only the lines that start with a char to a file...

Good luck. if my explaination is confusing then tell me and I'll sum it up again.

binaryDigit
12-11-2002, 09:59 AM
Originally posted by bhendry
Well, my example didn't work as expected, as the leading spaces are apparently filtered out by the forum. DOH!


just so you know. you can use the code tags to keep your indentation.


so that it doesn't change.
the way it's typed.

bhendry
12-11-2002, 01:38 PM
I continued my search, and I think I've found the answer I need...

This web-site has quite a lot of good things on it - the sed command below is from it.
http://www.cornerstonemag.com/sed/

So, here's my command-line

sed 's/^[ \t]*//' textfile.txt | fmt -w 1000 > finished.txt

It's imperfect in that it can't handle a paragraph over 1000 characters, but I don't think that there are any so it should work.

I am, of course, open to improvements if anyone has any suggestions!

Thanks,
Ben

with.a.twist
12-12-2002, 01:10 PM
My suggestion will run faster and has fewer key strokes. Although, if you are interested in learning sed, I guess you've done a great job at that.

Nice job,

smokybobo
12-12-2002, 02:43 PM
While this has already been solved in a couple of ways, I'll also pipe in with a third solution, just to give another option :) :

sed 's/^[[:space:]]*//' old_file | tr '\n' ' ' > new_file

I use '[[:space:]]', but you can also use '\t' if you want or even '[ ]'

Note that this does not have any limitations to size of paragraph.

bhendry
12-12-2002, 05:11 PM
I appreciate the responses - it's definitely been an eye-opening experience! If there's one way to do something in Unix, there's probably twelve other ways too... :)

Ben