Skip to main content

Blog entry by Dyass Khalid

RegEx 101 - A beginners guide to understand regular expressions

RegEx 101 - A beginners guide to understand regular expressions

A regular expression is a sequence of characters that specifies a search pattern in a text. Learn more about Its common uses in this regex 101 guide.

Regular Expressions Infographic

What are regular expressions:

A regular expression, or regex for short, is perhaps the
most common thing that every data scientist must deal with multiple times in
their career, and the frustration is natural because, at a vast majority of
universities, this skill is not taught unless you have taken some hard-core
Computer Science theory course. Even at times, trying to solve a problem using
regular expression can cause multiple issues, which is summed beautifully in this
meme:

regular expression memes

Making sense and using them can be quite daunting for
beginners, but this RegEx 101 blog has got you covered. Regular expressions, in essence, are
used to match strings in the text that we want to find. Consider the following
scenario:

               You are
interested in counting the number of occurrences of Pandas in a journal related
to endangered species to see how much focus is on this species. You write an
algorithm that calculates the occurrences of the word' panda.' However, as you
might have noticed, your algorithm will miss the words 'Panda' and 'Pandas.' In
this case, you might argue that a simple if-else condition will also count
these words. But imagine, while converting the journal alphabets of every word
is randomly capitalized or converted into a lower-case letter. Now, there are the
following possibilities

  • PaNda
  • pAnDaS
  • PaNdA

And may more variations as well. Now you must write a lot of
if-else conditions and even must write nested if-else conditions as well. What
if I tell you that you can do this in one line of code using regular
expressions? First, we need to learn some basics before coming back to solve
the problem ourselves.

Square Brackets ([]):

The name might sound scary,
but it is nothing but the symbol: []. Some people also refer to square brackets
as character class – a regular expression jargon word that means that it will
match any character inside the bracket. For instance:

Pattern

Matches

[Pp]enguin

Penguin, penguin

[0123456789]

(This will match any
digit)

[0oO]

0, o, O


Disjunction (|):

The pipe symbol means nothing but either 'A' or 'B', and it is helpful in cases where you want to select multiple strings simultaneously. For instance:

Pattern

Matches

A|B|C

A, B, C

Black|White

Black, White

[Bb]lack|[Ww]hite

Black, black,
White, white


Question Mark (?):

The question mark symbol means that the character it comes after is optional. For instance:

Pattern

Matches

Ab?c

Ac, Abc

Colou?r

Color, Colour


Asterisk (*):

The asterisk symbol matches with 0 or more occurrences of the earlier character or group. For instance:

Pattern

Matches

Sh*

(0 or more of earlier
character h)

S, Sh, Shh, Shhh.

(banana)*

(0 or more of earlier
banana. This will also match with nothing, but most regex engines will ignore
it or give you a warning in that case)

banana, bananabanana,
bananabananabanana.


Plus (+):

The plus symbol means to match with one or more occurrences of the earlier character or group. For instance:

Pattern

Matches

Sh+

(1 or more of earlier
character h)

Sh, Shh, Shhh.

(banana)+

(1 or more of the earlier
banana)

banana, bananabanana,
bananabananabanana.


Difference between Asterisk (*) and Plus(+):

The difference between the
asterisk confuses many people; even the experts sometimes must look at the
internet for their differences. However, there is an effortless way to remember
the distinction between them.

Imagine
you have a number 1, and you multiply it with 0:

1*0
= 0 or more occurrences of earlier character or group.

Now
suppose that you have the same number 1, and you add it with 0:

1+0
= 1 or more occurrences of an earlier character or group.

It is that simple when you try to understand things
intuitively.

Negation (^):

Negation has two everyday
use cases:

1.     
Inside square brackets, it will search
for the negation of whatever is inside the brackets.
For instance:

Pattern

Matches

[^Aa]

It will match
with anything that is not A or a

[^0123456789]

It will match anything
that is not a digit

 2.     
It can also be used as an anchor to
search for expressions at the start of the line(s) only.
For instance:

Pattern

Matches

^Apple

It will match
with every Apple that is at the start of any line in the text

^(Apple|Banana)

It will match with every
Apple and Banana that is at the start of any line in the text


Dollar ($):

A dollar is used to
search for expressions at the end of the line.
For instance:     

Pattern

Matches

$[0123456789]

It will match
with any digit at the end of any line in the text.

$([Pp]anda)

It will match with
every Panda and panda at the end of any line in the text.


Conclusion:
This article covered some of the very basic types of regular expression by using a story-telling approach. Regular expressions are indeed very hard to understand but if one develops an intuitive understanding then they can be easily learned.