Hello... RegExp?!

In this post, I'll try to explain the basics of regular expressions. Keep in mind that this sort-of tutorial is aimed for those who would like to learn regexps a bit better and/or are just starting and for those who don't know what regexps are at all. πŸ™ƒ So, let's get started!


So, what are these regexps?

Regular expressions ( or regex/regexp for short) are special text constructs for describing search patterns. With them, you can search long texts for specified values easier. Most often they're used for validating data e.g. IP and email addresses etc. Generally, they're extremely useful when dealing with stuff like that. So, what's the drawback? Well, their syntax can feel a bit messy for starters, but trust me - it's very easy to catch up!

The syntax for regular expressions doesn't differ much across programming languages (mainly in additional functionality), so the construct I'll show should be portable (in most cases) to your language-of-choice. Anyway, for purpose of this tutorial, I'll be using the JavaScript implementation. I divided all construct into groups, so they can be learned in an organized, ordered way.


Characters

To match any given character or digit you just type it. There's a catch though. In some cases, you may want to just match a character that's being used as a regex construct aka reserved character. Then, you'll have to escape your character. If you're coding for a while now, you'll know that it means just to precede certain character by backslash ( \ ) symbol and that's all. In JS the characters you have to escape are: + , * , ? , ^ , $ , \ , Β . Β , [ , ] , { , } , ( , ) , | , / ( divided by colons ). To give you an example:

// In JS your regexps are placed between two slashes

/Here goes your regex\. It is easy like one \+ one/

By escaping certain letters or sequences, you can access regex super-powers! Let's jump in and take a look at these available in JS:

  • \w - "word" - matches any word character (letters, digits and underscore);
  • \d - "digit" - matches any digit;
  • \s - "whitespace" - matches any whitespace (spaces, tabs, line breaks);
  • \t - "tab" - matches a tab character ( yes, that's the one created by Tab button );
  • \n - "new line" - matches LINE FEED character which is nothing more than just move-to-new-line indicator;

These are the most often used ones. But there's even more! The first three, which are being used almost all the time, have their negative counterparts in form of capitalized letters:

  • \W - "not word" - matches any character but the word ones e.g. colon ( , );
  • \D - "not digit" - matches any character that's not a digit e.g. letter;
  • \S - "not whitespace" - matches any character that's not whitespace one;

Hope you noticed capped letters. 😁 In JS there are 4 more escaped characters that (at least for me) aren't used as often as others. To just give you a glimpse of why? and how?, there they are:

  • \v - "vertical tab" - matches VERTICAL TAB character;
  • \f - "form feed" - matches FORM FEED character;
  • \r - "carriage return" - matches CARRIAGE RETURN character;
  • \0 - "null" - matches NULL ( char code 0 ) character;

I guess that now you know why these aren't really popular. These are just not used much. I think that's enough of theory - let's see an example:

/* Let's create something that will match "December 2018" string...
   and be creative :) */
/\we\Dem\Ser\s\d\S\d8/

Well, maybe that's not the best regex ever, but at least we've used almost all of the learned constructs. πŸ˜‰

Let's go onto escaped sequences then. These guys are a bit tougher and complex. With their help, you can much variety of unicode characters.

  • \000 - "octal escape" - matches character using provided 3-digit octal number; 000 is the lowest possible number while 377 is the highest, matching char code 255;
  • \xFF - "hexadecimal escape" - matches character using provided 2-digit hex number;
  • \uFFFF - "unicode escape" - matches character using provided 4-digit hex number;
  • \u{FFFF} - "extended unicode escape" - matches character using provided hex number without limit of digits and thus with full support for all unicodes;
    *Requires u flag - more on that later;

As you can see, using the escaped sequences we can match unicode character! Consider the example below, where we match 4 times the same unicode character - Β© (copyright symbol)

/* Match Β© 4 times in different ways.
   Leave last u character alone for now. */

/\251\xA9\u00A9\u{00A9}/u 

And that's it! Now you know almost all escaped constructions that you can use in JS regexps. Now let's go to another category!


Anchors

As the name implies, anchors let us match anchors in the text, which are beginning and ending of text and boundaries between words. These are pretty easy. πŸ˜€

  • ^ - "beginning" - matches the beginning of supplied string or single line ( with m flag );
  • $ - "ending" - matches ending of supplied string or single line ( with m flag );
  • \b - "word boundary" - matches word boundary i.e. position between the last or first character and whitespace;
  • \B - "not word boundary" - matches any position that's not a word boundary;

One more thing to note though. Anchors match against positions not characters this basically means that anchors won't include any more characters in the result of your regexp execution. Example's coming!

/* Match ordinary "Regular expressions" string.
   Notice that even with a word boundary matched,
   we still have to match a whitespace.
   Remember, \b matches only a position between them! */

/^Regular\b\sexpressions\b$/

Quantifiers

Now, this is where the fun begins! With quantifiers, you can quantify how many of specified characters you want to match. Quantifiers are really useful and easy to learn.

  • + - "plus" - Β let's you match 1 or more of preceding construct;
  • * - "star" - let's you match 0 or more of preceding construct;
  • {1} - "quantifier" - let's you quantify how many of preceding construct you want to match, you can also provide two numbers divided by colon to indicate the lower and upper limit of constructs to match, like {1,3};
  • ? - "optional" - let's you indicate preceding construct as optional (no need to match);
  • ? - "lazy" - let's you indicate preceding quantifier as lazy (match as little characters as possible) ;
  • | - "alternation" - let's you provide alternative construct to match, something like boolean or operator;

Quantifiers allow us to create much better and more expressing regular expressions. πŸ˜…

/* Let's match "December 2018" this time a little bit different...
   Take a look at two \w constructs, that's because we've used lazy modifier.
   This makes \w+? match only one letter. */

/\w+?\w+\s\d+/

Groups & sets

Till here you've come a long way learning regexps' syntax. Now its time to learn how to order you regex constructions with groups and sets.

Groups allow you to group (what a surprise) your regexp constructs into groups. πŸ˜‚ There are two types of groups: capturing and non-capturing. Non-capturing groups are used to just group your constructs for later use with quantifiers (for example). Capturing groups give you additional ability to get results of grouped regexp constructs exclusively, after running the regex. You can also reference them later with their number. Also, when it comes to numbering groups, it starts from 1 for the first group, and each new group gets its number from the opening parenthesis order.

  • (ABC) - "capturing group" - content of the group goes directly between parenthesis;
  • (?:ABC) - "non-capturing group" - content of the non-capturing group goes after the : symbol and closing parenthesis.
  • \1 - "captured group reference" - allows you to reference captured group with its number;
// Let's match "regex regexp" string

/(regex)\s\1p/

Sets, on the other hand, allow you to create sets of characters to match. Negated set matches against any character that is not included inside of it. Inside a set you don't have to escape the same characters like the ones given before, only - and ] for obvious reason. Inside a set, you can also provide a range of letters or digits by connecting the beginning and ending ones with a dash ( - ).

  • [ABC] - "set" - matches any of provided characters, equal to construction like A|B|C;
  • [^ABC] - "negated set" - matches any characters other than provided ones (A, B, C)
  • [A-D] - "range" - matches any letter from A to D;
  • [^1-3] - "negated range" - matches any digit except 1 to 3;
// Match any three letters with range

/[A-Z]{3}/

Lookarounds

To keep it simple - lookarounds are constructs that allow you to check if given value precedes or follows the other one, without including it in the result. There are 2 or rather 4 types of lookarounds:

  • (?=ABC) - "positive lookahead" - matches if preceded value is followed by the one matched by expression inside;
  • (?!ABC) - "negative lookahead" - matches if preceded value is not followed by the one matched by expression inside;
  • (?<=ABC) - "positive lookbehind" - matches if following value is preceded by the one matched by expression inside;
  • (?<!ABC) - "negative lookbehind" - matches if following value is not preceded by the one matched by expression inside;

Keep in mind that as for JavaScript, lookbehinds are supported only in newest ES2018 and are available only in latest Google Chrome browsers (at the time of writing). Now, let's give them a shot, shall we? πŸ˜‰

/* Let's match "reg" in "regexp" using lookahead
   and "exp" using lookbehind. 
   Remember that lookarounds doesn't include the parts inside them
   in the result */

/reg(?=exp)/
/(?<=reg)exp/

Let's end this - FLAGS

Flags are really important in regexps. These change the way regexps are interpreted. If you were paying attention - these appeared earlier in the examples. Now, in JS we can normally add flags (which have a form of different letters) directly after the closing slash. Let's explore all flags available in JS.

  • i - "ignore case" - makes whole expression case insensitive;
  • g - "global" - preserve the index of last match, so you can find the next one instead of the same over and over again;
  • m - "multiline" - makes anchors ^ and $ match beginning and ending of the line instead of the text overall;
  • u - "unicode" - allows the use of \u{FFFF} (extended unicodes support) with more digits than 4 ( available in newer JS implementations );
  • y - "sticky" - makes expression match only from the last index, deactivates g flag (available in newer JS implementations );

So, here you go with an example.

/* The u flag allows the use of extended unicodes.
   Notice where the flag is located. */

/\u{FFFFF}/u

The end

So you can believe me or not but that's the whole syntax of JavaScript regexps. If you feel like it's a bit too much, then don't panic! It's not that hard to remember all those constructs - you have to trust me on this. 😎 Also, remember that with this knowledge you can easily write regular expressions in many other languages! Hope you've learned something new today or at least that this article provided a bit of memory refresh or was just nice to read.

If you are a JavaScripter like me and would like to write complex regexps (those can sometimes look awfully complex for some) then there's a bonus for you! I've written a library which provides you with a nice, chainable API for constructing regexp. It also provides autocompletion in editors like VS Code with help of TypeScript, so if you want - check ReX.js out!

If you liked this article, consider sharing it anywhere with the buttons below and follow me on Twitter for more interesting content. πŸ‘