Blog post cover
Arek Nawo
04 Dec 2018
10 min read

Hello... RegExp?!

In this post, I’ll try to explain the basics of regular expressions. Keep in mind that this sort-of tutorial is aimed for those who would like to learn regexps a bit better and/or are just starting and for those who don’t know what regexps are at all. So, let’s get started!


So, what are these regexps?

Regular expressions ( or regex/regexp for short) are special text constructs for describing search patterns. With them, you can search long texts for specified values easier. Most often they’re used for validating data e.g. IP and email addresses etc. Generally, they’re extremely useful when dealing with stuff like that. So, what’s the drawback? Well, their syntax can feel a bit messy for starters, but trust me - it’s very easy to catch up!

The syntax for regular expressions doesn’t differ much across programming languages (mainly in additional functionality), so the construct I’ll show should be portable (in most cases) to your language-of-choice. Anyway, for purpose of this tutorial, I’ll be using the JavaScript implementation. I divided all construct into groups, so they can be learned in an organized, ordered way.


Characters

To match any given character or digit you just type it. There’s a catch though. In some cases, you may want to just match a character that’s being used as a regex construct aka reserved character. Then, you’ll have to escape your character. If you’re coding for a while now, you’ll know that it means just to precede certain character by backslash ( \ ) symbol and that’s all. In JS the characters you have to escape are: + , * , ? , ^ , $ , \ ,  .  , [ , ] , { , } , ( , ) , | , / ( divided by colons ). To give you an example:

// In JS your regexps are placed between two slashes

/Here goes your regex\. It is easy like one \+ one/

By escaping certain letters or sequences, you can access regex super-powers! Let’s jump in and take a look at these available in JS:

These are the most often used ones. But there’s even more! The first three, which are being used almost all the time, have their negative counterparts in form of capitalized letters:

Hope you noticed capped letters. 🙃 In JS there are 4 more escaped characters that (at least for me) aren’t used as often as others. To just give you a glimpse of why? and how?, there they are:

I guess that now you know why these aren’t really popular. These are just not used much. I think that’s enough of theory - let’s see an example:

/* Let's create something that will match "December 2018" string...
   and be creative :) */
/\we\Dem\Ser\s\d\S\d8/

Well, maybe that’s not the best regex ever, but at least we’ve used almost all of the learned constructs. 😄

Let’s go onto escaped sequences then. These guys are a bit tougher and complex. With their help, you can much variety of unicode characters.

As you can see, using the escaped sequences we can match unicode character! Consider the example below, where we match 4 times the same unicode character - © (copyright symbol)

/* Match © 4 times in different ways.
   Leave last u character alone for now. */

/\251\xA9\u00A9\u{00A9}/u 

And that’s it! Now you know almost all escaped constructions that you can use in JS regexps. Now let’s go to another category!


Anchors

As the name implies, anchors let us match anchors in the text, which are beginning and ending of text and boundaries between words. These are pretty easy.

One more thing to note though. Anchors match against positions not characters this basically means that anchors won’t include any more characters in the result of your regexp execution. Example’s coming!

/* Match ordinary "Regular expressions" string.
   Notice that even with a word boundary matched,
   we still have to match a whitespace.
   Remember, \b matches only a position between them! */

/^Regular\b\sexpressions\b$/

Quantifiers

Now, this is where the fun begins! With quantifiers, you can quantify how many of specified characters you want to match. Quantifiers are really useful and easy to learn.

Quantifiers allow us to create much better and more expressing regular expressions.

/* Let's match "December 2018" this time a little bit different...
   Take a look at two \w constructs, that's because we've used lazy modifier.
   This makes \w+? match only one letter. */

/\w+?\w+\s\d+/

Groups & sets

Till here you’ve come a long way learning regexps’ syntax. Now its time to learn how to order you regex constructions with groups and sets.

Groups allow you to group (what a surprise) your regexp constructs into groups. 👍 There are two types of groups: capturing and non-capturing. Non-capturing groups are used to just group your constructs for later use with quantifiers (for example). Capturing groups give you additional ability to get results of grouped regexp constructs exclusively, after running the regex. You can also reference them later with their number. Also, when it comes to numbering groups, it starts from 1 for the first group, and each new group gets its number from the opening parenthesis order.

// Let's match "regex regexp" string

/(regex)\s\1p/

Sets, on the other hand, allow you to create sets of characters to match. Negated set matches against any character that is not included inside of it. Inside a set you don’t have to escape the same characters like the ones given before, only - and ] for obvious reason. Inside a set, you can also provide a range of letters or digits by connecting the beginning and ending ones with a dash ( - ).

// Match any three letters with range

/[A-Z]{3}/

Lookarounds

To keep it simple - lookarounds are constructs that allow you to check if given value precedes or follows the other one, without including it in the result. There are 2 or rather 4 types of lookarounds:

Keep in mind that as for JavaScript, lookbehinds are supported only in newest ES2018 and are available only in latest Google Chrome browsers (at the time of writing). Now, let’s give them a shot, shall we?

/* Let's match "reg" in "regexp" using lookahead
   and "exp" using lookbehind. 
   Remember that lookarounds doesn't include the parts inside them
   in the result */

/reg(?=exp)/
/(?<=reg)exp/

Let’s end this - FLAGS

Flags are really important in regexps. These change the way regexps are interpreted. If you were paying attention - these appeared earlier in the examples. Now, in JS we can normally add flags (which have a form of different letters) directly after the closing slash. Let’s explore all flags available in JS.

So, here you go with an example.

/* The u flag allows the use of extended unicodes.
   Notice where the flag is located. */

/\u{FFFFF}/u

The end

So you can believe me or not but that’s the whole syntax of JavaScript regexps. If you feel like it’s a bit too much, then don’t panic! It’s not that hard to remember all those constructs - you have to trust me on this one. Also, remember that with this knowledge you can easily write regular expressions in many other languages! Hope you’ve learned something new today or at least that this article provided a bit of memory refresh or was just nice to read.

If you are a JavaScripter like me and would like to write complex regexps (those can sometimes look awfully complex for some) then there’s a bonus for you! I’ve written a library which provides you with a nice, chainable API for constructing regexp. It also provides autocompletion in editors like VS Code with help of TypeScript, so if you want - check ReX.js out!

If you liked this article, consider sharing it anywhere with the buttons below and follow me on Twitter for more interesting content. 👉

If you need

Custom Web App

I can help you get your next project, from idea to reality.

© 2024 Arek Nawo Ideas