Last minute geek

last minute tech news from around the net

Thursday, Oct 18th

Last update01:00:00 AM

You are here: English WTF CodeSOD: Look Ahead. Look Out!

CodeSOD: Look Ahead. Look Out!

User Rating: / 0

I'm an old person. It's the sort of thing that happens when you aren't looking. All the kids these days are writing Slack and Discord bots in JavaScript, and I remember writing my first chatbots in Perl and hooking them into IRC. Fortunately, all the WTFs in my Perl chatbots have been lost to time.

"P" has a peer who wants to scrape all the image URLs out of a Discord chat channel. Those URLs will be fetched, then passed through an image processing pipeline to organize and catalog frequently used images, regardless of their origin.

Our intrepid scraper, however, doesn't want to run the risk of trying to request a URL that might be invalid. So they need a way to accurately validate every URL.

Now, the trick to URLs, and URIs in general, is that they have a grammar that seems simple but is deceptively complex and doesn't lend itself to precise validation via regular expressions. If you were a sane person, you'd generally just ballpark it into the neighborhood and handle exceptions, or maybe copy/paste from StackOverflow and call it a day.

This developer spent 7 hours developing their own regular expression to validate a URL. They tested it with every URL they could think of, and it passed with 100% accuracy, which sounds like the kind of robust testing we'd expect from the person who wrote this:

const regex = /((?:(http|https|Http|Https|rtsp|Rtsp)://(?:(?: [a-zA-Z0-9$-_.+!*'(),;?&=]|(?:%[a-fA-F0-9]{2})){1,64}(?:: (?:[a-zA-Z0-9$-_.+!*'(),;?&=]|(?:%[a-fA-F0-9]{2})){1,25})? @)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9-]{0,64}.)+(?:(?:aero|arpa|asia|a [cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c [acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g [abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop] )|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz]) |(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r [eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u [agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1] [0-9]{2}|[1-9][0-9]|[1-9]).(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9] |[1-9]|0).(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0).(?:25 [0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?::d{1,5})?)(/(?:(?: [a-zA-Z0-9;/?:@&=#~-.+!*'(),_])|(?:%[a-fA-F0-9]{2}))*)? (?:b|$)/gi

Note each use of (?:). These are "look ahead" matches, which will execute depending on what comes after them. Using one or two of these in a regex makes its massively more complicated. This regex uses 32 look ahead expressions, taking "unreadability" to a new height, and flirting with the lesser demons which serve Zalgo. Bonus points for making the comparison case insensitive, but also checking for both http and Http, just in case.

"There's no module on NPM that can do all of this!" the developer proudly proclaimed. They then presumably uploaded it to NPM as a "microframework", used it within a few of their own modules, and then those modules got used by some other modules, and now 75% of the web depends on this regex.

[Advertisement] ProGet can centralize your organization's software applications and components to provide uniform access to developers and servers. Check it out!

Read all
Comment Policy:
We pre-moderate any comments and welcome all kinds of thoughts, supportive, dissenting, critical or otherwise. We delete or censor comments that are:

* abusive
* off-topic
* contain personal attacks, or against any company or organization
* promote hate of any kind
* use excessively foul language
* is blatantly spam or advertising

We do not discriminate based on the person who is posting, and we never censor comments for political or ideological reasons. We never delete an appropriate comment because we disagree with its viewpoint or ideology, and we never publish an inappropriate comment because we agree with or support its viewpoint or ideology.

Attention spammers: we manually approve all comments. Spamming and blatant advertising will NOT be published on this site and is deleted immediately, you've been warned, do not waste your time here.

Add comment

Security code