Click to See Complete Forum and Search --> : regular expressions


Glaurung
11-08-2000, 04:20 PM
I've just written my first program (TM) in perl and I love it, there's just one thing I don't quite get now...

If I have an url, let's say http://www.linuxnewbie.org/cgi-bin/ubbcgi, how can I extract the last part of it ("ubbcgi")? It seems much easier when the delimiter is a space...

btw: why doesn't m/[a-zA-Z]+.".html"/i work?

YaRness
11-08-2000, 04:35 PM
#!c:\perl\bin\perl.exe -w

$url="http://www.something.com/yada/ubbcgi";

print "$url\n";

if ($url =~ /.*\/([^\/]+)$/)
{
print "$1\n";
}



this returned "ubbcgi". should be able to adapt that as needed, as it is it returns anything after the last "/" in any string.

YaRness
11-08-2000, 04:41 PM
that pattern above, parsed out:


/
.* #matches anything
\/ #matches a "/"
([^\/]+)$ #matches any string that does not contain a "/" and runs to the end of line
/x # /x ignores whitespace, so i can format it like this

# the () grouping lets me backreference what's in it as $1


you could of course use something other than // to do pattern matching and make it easier to work with "/":


#replaced m// with m??
m?.*/([^/]+)$?

TheLinuxDuck
11-08-2000, 04:42 PM
Originally posted by YaRness:

if ($url =~ /.*\/([^\/]+)$/)



Just out of curiosty, YaR, the ([^\/]+) is looking for anything that is not a /, correct? I've not used the negate ^ operator very much. I got confused when going from the beginning of a string to the negated to eatting ^'s.

------------------
TheLinuxDuck
Wait... that's a penguin?!?!?
:wq

Glaurung
11-08-2000, 04:57 PM
Cool! Thanks, and thanks even more for writing that out! I got away with perl pretty fast, but the regular expressions are still very difficult to me...

YaRness
11-08-2000, 05:02 PM
Just out of curiosty, YaR, the ([^\/]+) is looking for anything that is not a /, correct? I've not used the negate ^ operator very much. I got confused when going from the beginning of a string to the negated to eatting ^'s.


yes, it's used there as a negate operator. i wrote that code, then had to look it up in the perl docs to verify it, cuz i thought the same thing. i don't know if there is another operator that would work (! doesn't, i tried). *shrug*

i picked that way of doing it for that particular type of string (url) so it would pick up something like:
http://www.YaRness.com/yada/something1really-horrid_tomatch.html

or whatever. i dunno. i just know that (i think) there's no way you'll ever find a "/" in that last filename or cgi command or whatever, but you might find some other weird stuff.

(*cut and paste* oh yeah, something like this: <edit> i had the url for this page posted, but it was long and ugly, so just look at yer own browser </edit>
------------------
"Assembly of Japanese bicycle require great peace of mind."
Registered Linux User #188285 http://counter.li.org/
------------------

[This message has been edited by YaRness (edited 08 November 2000).]

[This message has been edited by YaRness (edited 08 November 2000).]

TheLinuxDuck
11-08-2000, 05:13 PM
YaR:

I have used this for filename/matching in a previous script. I didn't need to save the originating path, so this works pretty well:

my($path)="http://www.linuxnewbie.org/cgi-bin/ubbcgi";
$path=~ s/.*\/(.*)/$1/;
print "$path\n";


It would be just as easy to save the path as well like:

my($file)="http://www.linuxnewbie.org/cgi-bin/ubbcgi";
$file=~ s/(.*\/)(.*)/$2/;
my($path)=$1;
print "$path\n$file\n";


But, just as perl offers, TMTOWTDI.. http://www.linuxnewbie.org/ubb/smile.gif

------------------
TheLinuxDuck
Wait... that's a penguin?!?!?
:wq

kmj
11-08-2000, 05:52 PM
You should know that the ^ only negates when it's inside [ ]; otherwise it means beginning of line. You probably already knew that.

jemfinch
11-08-2000, 11:47 PM
The best (read: most efficient and easy to understand, IMO) way is

m|.*/(.*)|


That puts everything after the last slash into $1, and leaves the entire string intact.

Jeremy

klamath
11-09-2000, 12:22 AM
Jemfinch, would that not be more efficient written as:


m|/(.*?)$|


I believe that would do it, but the greediness of the '*' operator can be tricky to work out.

------------------
- Klamath
Get my GnuPG Key Here (http://klamath.dyndns.org/mykey.asc)
Looking for an open source project to contribute to? Check out the BBB (http://bbb.sourceforge.net)

jemfinch
11-09-2000, 06:29 AM
Originally posted by klamath:
Jemfinch, would that not be more efficient written as:


m|/(.*?)$|


I believe that would do it, but the greediness of the '*' operator can be tricky to work out.



That actually doesn't do it; the question mark to indicate "less greediness" only applies when there are further matches to be made later on in the regexp. In this case, there are no such matches, and thus, the question mark doesn't do anything to the greediness.

Tweed:~/src/python/my/Bookmarks $ perl -ne 'print "$1\n" if m|/(.*?)$|' http://localhost/~jfincher/cgi-bin/bookmarks
/localhost/~jfincher/cgi-bin/bookmarks http://arstechnica.infopop.net/OpenTopic/page?q=Y&a=frm&s=50009562&f=96509133
/arstechnica.infopop.net/OpenTopic/page?q=Y&a=frm&s=50009562&f=96509133 http://www.ecst.csuchico.edu/~beej/guide/net/
/www.ecst.csuchico.edu/~beej/guide/net/


Also, less greedy constructs are almost never the most efficient method because they invariably require backtracking. First the regexp sucks up all the stuff it can, and then backtracks back to where it should be, which is slow.

Jeremy

YaRness
11-09-2000, 11:07 AM
Originally posted by jemfinch:
The best (read: most efficient and easy to understand, IMO) way is

m|.*/(.*)|


That puts everything after the last slash into $1, and leaves the entire string intact.

Jeremy

this is one of those things that bugs me about perl. the above code works. it doesn't LOOK explicit enough to work though. to me, that seems like it could match any string that comes after a slash anywhere in the pattern space. i guess it reads the string/line/pattern space and starts working back until it finds the first match? i think i would at least do


m|.*/(.*)$|


just to keep my sanity. or do a m//x and stick some comments in maybe so i recognize it later.

------------------
"Assembly of Japanese bicycle require great peace of mind."
Registered Linux User #188285 http://counter.li.org/
------------------

jemfinch
11-09-2000, 01:46 PM
Originally posted by YaRness:
this is one of those things that bugs me about perl. the above code works. it doesn't LOOK explicit enough to work though. to me, that seems like it could match any string that comes after a slash anywhere in the pattern space.


It definitely looks explicit enough to work, it just doesn't to you http://www.linuxnewbie.org/ubb/smile.gif

All quanitifers (?, *, +) in regular expressions' mini-language are greey. They take everything they can that leaves enough for the rest of the pattern to match. Thus, the first ".*" takes everything up until the last "/", and leaves to the second ".*" everything after it (or nothing, whichever comes first)

Feel free to add the trailing $; it won't decrease the efficiency. It will, however, add Yet Another Useless Character to the regexp, thus making it harder for other people to read. (not that that's ever much of a worry with perl code http://www.linuxnewbie.org/ubb/smile.gif)

Jeremy

P.S. My explanation of why ".*?" is less efficient wasn't really good...I'm looking into a better one.