Click to See Complete Forum and Search --> : Big Script problem


Linux_cat
01-05-2005, 11:07 AM
Right I hope someone can help, or point me in the right direction, I know that perl would solve the problem but i dont know it and dont have time to learn it as this has to be done within a week or some.

I have two test two text files that merge into one, however each file holds around 15,000,000 lines of data and compresse's into a 5,000,000 line file. I have recreated the rules that it matches on and have written the following script:

#!/bin/bash

YMOUT031=${1:?"requires an argument" }
YMOUT015=${2:?"requires an argument" }

zcat $YMOUT031 > YMOUT31unzipped.$$
zcat $YMOUT015 |cut -c7-11,18,18-24,46,47-52,75- > YMOUT15unzipped.$$



while read line_text
do
field=`echo $line_text|awk -F'|' '{printf $5"|"$4"|"$3"\n"}'`
matchedrows=`grep "$field" YMOUT15unzipped.$$|tr '\n' ' '`

if [ "$matchedrows" == "" ]
then
echo "<$line_text>" >> nomatches
else
echo "The row: $line_text Matches:$matchedrows " >> matches #this is where the error lies
fi

done < "YMOUT31unzipped.$$"


Although this works it is way to slow, it takes around 4-5 seconds to grep and pipe it to either output, which will take me well over half a year to complete!!!!!!

does anyone know of a way in which i can speed up the script dramatically so that it will only take a couple of days??, week maximum???

any help will be much apprieciated.

evac-q8r
01-05-2005, 01:46 PM
I see you got cut working. Could you explain a little better what your script is trying to do? At least for me.

EVAC

Linux_cat
01-05-2005, 02:51 PM
Ill try,

I am basically try to test a a system which takes two input files and puts them into a DB and then outputs another text file that has been merged and numerous calculations have been done on it.

Im testing it by writing this script that takes the two input files and matches them using 3 specific fields and merges them, to give you a text file that should have the same number of lines as the one outputed from the DB minus all the calculations which I will add later after I get this working alot faster.

I use cut to filter out the text in the fields of one of the text files so grep cn match on it.

one file YMOUT31 is fed into the loop one line at a time and awk is used to extract the 3 fields, which is matched against the YMOUT15 that has been cut and formatted. If it finds a match it pipes the output to macthes if it does not find a match with the line it pipes it to nomatches.

I am left with two text files, the one called matches should hopefully be the same number of lines as the merged file i am testing it against....I hope Ive done a good job explaining this!!!.

However as the files hold so many lines and grep takes around 4 seconds to search the file, this script will take over half a year to complete on the pcs that I am using, I need to re-write it so it runs alot alot alot alot alot faster....have any ideas??.

evac-q8r
01-05-2005, 07:17 PM
Hmmmmm... 15,000,000 * 5,000,000 = 75,000,000,000,000 grep calls. Unless I know exactly what your data looks like, well I know a little bit from the other post. Or unless I know what the data is the underlying idea is I'm useless. From what I understand is you cut out data from one file and check to see if it is in the other file. If I knew the characteristics of the data might be able to find some shortcuts. Not an expert with shell scripts to a degree where I could find inefficiencies with your code either. Well at this stage just keep me posted. What I could suggest (if I understand the problem correctly) is when you do find a match in the file use grep to remove that line of data. Now you won't have as many lines to check.

EVAC

Linux_cat
01-06-2005, 04:52 AM
cheers mate your being most helpful, will look into removing lines....


Do you know if an awk script will get through it faster?.