An Introduction To Tableau Regular Expressions (REGEX)
Kevin and I are incredibly excited to have Don Wise join us today for the first of a multi-part series on regular expressions. I spend a lot of time on the forums and I’ve personally run into many questions that require complex string parsing. I can typically recognize them as problems that can be solved using regular expressions, but I’m not very skilled with them, so I often find myself calling in Don. I’m always amazed at his ability to create these expressions and explain them in a way that is easy to understand. After my last encounter with one of Don’s brilliant solutions, I asked if he’d be willing to write a blog on the topic. And, lucky for all of us, that blog turned into this amazing series.
Of course, Don is more than just the resident regex expert. He is a Tableau Forums Ambassador, has an incredibly deep knowledge of Tableau as a whole, and is one of the top contributors on the Tableau Community Forums. If you’ve ever asked a question on the forums, it’s quite likely that you’ve been helped by Don. Recently retired from the Public-Safety arena, Don has extensive background in Fire/EMS and 9-1-1 operations and related data at local, State and Federal level. You can get in touch with Don either on the forums or through his website, www.dcwise.net.
Introduction
Have
you ever wondered what those Tableau functions starting with REGEXP are all
about? Ever wanted to do something that might
be handled with a SPLIT() function, but doesn’t exactly turn out how you
expected? Or perhaps believe those REGEXP functions are a bit intimidating
(sometimes they look like Egyptian hieroglyphics!!)?
Yeah,
I’ve been there on all those occasions. So, I’m with you. As a bit of
background, I’m not a programmer, never written a line of code (other than a
Tableau calculation or two), I do know and understand SQL syntax and
enterprise-level systems just enough to be dangerous.
Yes,
I’m mostly a layperson with a basic understanding of data structure, data
analysis, and related tools to clean and reshape data…so, this series is for
all of us who’d like to understand a bit more what those functions are all
about. And, when/how to use them!
I
think one of the reasons I leaned into learning REGEX expressions is that they
follow patterns. Really, that’s exactly what they do—they help pull
out (extract) or match patterns in strings of data. At the beginning of my
quest to learn REGEX, it all looked like a bunch of gibberish. Completely
non-understandable. Now, I can read and build those hieroglyphics with more
confidence and understanding and I’m going to use this blog series to help you
understand and write them as well.
Will
this series resolve all your REGEX data manipulation questions? Probably not.
But it’ll be a good starting point on your road to start using them in Tableau.
We’ll supply some related links and online tools along the way as well as some
relevant links to Tableau’s Help sections. Most importantly, many of the
examples—from the basic to
intermediate to more advanced—will use actual solutions as found in
the Tableau Community
Forums.
History of
Regular Expressions
Regular
Expressions,
often abbreviated as REGEX or REGEXP, have been around since the
1980’s. They were the outgrowth of Regular Language developed in the 1950’s,
which was used primarily in UNIX-based systems for processing text. The syntax
is slightly different from system to system, so what I’ll be sharing here is
focused on Tableau.
The
primary use case for regular expressions is extracting a particular data
element from a string of data. Regular expressions are commonly used by website
and application developers to validate whether the user input matches the
required structure of the data element. An example that we encounter on almost
a daily basis is an email address. Email addresses, for example, require a user
name, followed by the @ sign, followed by a domain (user-name@domain-name.com).
We could use REGEXP_MATCH to validate if the user input for that data
element/field was entered properly and display an error message if it didn’t
match the required format correctly.
How Do Regular
Expressions Work?
Ok,
so we know what regular expressions are, but how do they work? Well, like any
“language” or format, there are components that help build the syntax or
composition of an expression, much like building a calculation in Tableau using
IF-THIS-THEN-THAT type syntax. When built correctly, the underlying algorithm
in Tableau will translate the pattern and provide a match or extract the data.
Whenever there isn’t a match to the input pattern, Tableau returns Null.
That
means we need to understand what builds/supports the syntax. In other words,
what do those hieroglyphics mean?
A quick word of
warning: This section may feel a bit overwhelming as much of the terminology
and the definitions will be foreign to you. But that’s normal so please do not
worry too much about it. I just need to cover some of these basic building
blocks before showing you how regular expressions work in practice.
A
regular expression can contain 3 distinct pieces. Note: Not all expressions will require all three, it depends on the
requirement.
1)
Regular Expression Metacharacters
Classes
– A
Metacharacter is a character that has special meaning to a computer program,
such as a shell interpreter, or in our case, a regular expression (REGEX)
engine.
2)
Regular Expression Operators/Quantifiers – These are used to
refine the pattern. For example, how
many times should the pattern match be repeated.
3)
Set Expressions (Character Classes) – These are used to
help define specific characters that we want to match.
Regular Expression
Metacharacters Classes
In regular
expressions, using the backslash with certain letters
causes certain characters to do something. A capitalized letter will have
different meaning and action than that of a lower case:
Character |
Description |
\b |
Match
if the current position is a word boundary. Boundaries occur at the
transitions between word (\w) and non-word (\W) characters, with combining
marks ignored. |
\B |
Match
if the current position is not a word boundary. |
\d |
Match
any character with the Unicode General Category of Nd (Number, Decimal
Digit.) |
\D |
Match
any character that is not a decimal digit. |
\w |
Match
a word character. |
\W |
Match
a non-word character. |
. |
Match
any character. |
^ |
Match
at the beginning of a line. |
$ |
Match
at the end of a line. |
\ |
Quotes
the following character. Characters that must be quoted to be treated as
literals are * ? + [ ( ) { } ^ $ | \ . |
\ |
Quotes
the following character. Characters that must be quoted to be treated as
literals are [ ] \ Characters that may need to be quoted, depending on the
context are - & |
Regular
Expression Operators/Quantifiers
The function
of operators/quantifiers is different than their
literal meanings. For
example, the + sign would mean addition to us normally, but in regular expressions, it
can mean do something—in
this case, perform one or more repetitions of a match. Below is a non-exhaustive list:
Operator |
Description |
| |
Alternation. A|B matches either A or B. |
* |
Match 0 or more times. Match as many times as possible. |
+ |
Match 1 or more times. Match as many times as possible. |
? |
Match zero or one times. Prefer one. |
{n} |
Match exactly n times |
{n,} |
Match at least n times. Match as many times as possible. |
{n,m} |
Match between n and m times. Match as many times as possible,
but not more than m. |
*? |
Match 0 or more times. Match as few times as possible. |
+? |
Match 1 or more times. Match as few times as possible. |
?? |
Match zero or one times. Prefer zero. |
{n}? |
Match exactly n times. |
{n,}? |
Match at least n times, but no more than required for an overall
pattern match. |
{n,m}? |
Match between n and m times. Match as few times as possible, but
not less than n. |
Set Expressions
(Character Classes)
Example |
Description |
[abc] |
Match
any of the characters a, b or c. |
[^abc] |
Negation
- match any character except a, b, or c. |
[A-M] |
Range
- match any character from A to M. |
As
an example, the below graphic breaks out the syntax/schema of all three pieces
which can be used to validate or match on a Social Security Number:
OK,
that’s a bit to take in and understand, for now. But before going on to the
next section, please take a moment to really look at those symbols and what
they might possibly do with a set of data. Remember, regular expressions are
quite literal/explicit until we add in certain operators, metacharacters, or
set expressions.
Note: If you’d
like to see all three tables, they’re based on the International
Components for Unicode (ICU) Regular
Expression libraries.
Regular
Expressions in Tableau
Now
that you have an understanding of the history of regular expressions and their
basic building blocks, let’s talk about how to create them in Tableau. Tableau
has four different REGEXP functions, which either match, extract, or replace
strings within a larger string of data. If that doesn’t make sense, be patient
as I’ll be sharing some examples shortly.
REGEXP_EXTRACT
(string, pattern)
– Looks for a particular pattern within a string or substring and extracts that data element.
REGEXP_EXTRACT_NTH
(string, pattern, index) – Looks for a particular pattern
within a string or substring, starting at the nth position in the string, and extracts that data element.
REGEXP_MATCH
(string, pattern)
– Looks for a particular pattern within a string or substring and returns TRUE
if there’s an exact match.
REGEXP_REPLACE
(string, pattern, replacement) – Looks for a particular pattern
within a string or substring and replaces
that data element with a different data element.
Tableau
regular expressions can also be “nested”, meaning, you can use a
REGEXP_REPLACE() function within another REGEXP_REPLACE() function, much like
you can do currently with a regular REPLACE() function. Nesting allows you to
make multiple passes over the data so that you do not need to cram all your
logic into one expression. But let’s not get ahead of ourselves…or too
overwhelmed!
Tableau Examples
Let’s
now look at some regular expressions examples. We’re going to start off with a
few relatively straightforward examples, in order to ease into it.
We’re
going to start out with some simple examples that then get progressively more
complex as we move forward. I use the term “simple,” but for those who are new
to regular expressions, these may seem anything but simple. Please don’t get
overwhelmed as you work through these—they can be tricky to get your head
around, but with exposure, practice, and time, you’ll become a regular
expressions master!
Remember
the example tables of Metacharacters, Operators, and Sets? We’ll be working with those various symbols
from those tables to support our pattern matching and extracting efforts. To
keep things basic, much like the Social Security Number pattern matching, we’ll
use a simplified method which draws on all three of those tables for our first
example.
For
this example, let’s use the following simple set of student course data:
Text |
Smith,
Paul is a student of English, Student ID: ABC123 |
Jones,
Mary is a student of Math, Student ID: DEF345 |
Lee,
Sally is a student of Economics, Student ID: GHI678 |
Brown,
Sam is a student of Music, Student ID: ABC345 |
Black,
Kelley is a student of Business, Student ID: ABC123 |
Just
copy those lines into Excel or Google Sheets, then use either Tableau Desktop
or Tableau Prep to connect to your spreadsheet.
We
will establish just 2 goals for matching on this data:
1) Get the person’s last name.
2) Get the person’s first name.
We’re
going to use REGEXP_EXTRACT to extract each of these different components,
using the following syntax:
REGEXP_EXTRACT([Text], '( )')
After
the first comma we’re placing a beginning parenthesis and closing parenthesis
within quotes. Why? Well, the quotes are required syntax, but the parenthesis
is a different thing altogether. It took me a while to understand that, in
almost every circumstance, a set of enclosing parentheses are needed to create
what’s known as a “capture group”. In other words, what specific values are we
looking to get out of the pattern? Tableau really, really likes capture groups.
So, I’ll always start off with that initial syntax, then build my pattern—those
hieroglyphics— in and/or around
that set of parentheses.
Last
Name
Let’s
work on the first problem statement of getting the person’s last name. In our
data, the last name is always the first “word” in the text, so the simplest
syntax to capture group and extract that out would be:
REGEXP_EXTRACT([Text], '(\w+)')
So,
what’s going on here really? We know
that the parenthesis are our ‘capture group’, the data we’re seeking to extract
or match upon. That initial backslash is
considered our lead-in or starting point of what we’re trying to grab. The backslash by itself would mean start here
or rather start with this next character. A small ‘w’ = match on a word
character. It’ll also match on alphanumeric values. Using the backslash in REGEXP in combination
with a small ‘w’ forces that small w to no longer be a literal letter ‘w’ but
to become an action to do something. When followed by
the + sign, it causes a match of all ensuing characters, but just once. Meaning,
it’ll stop the match process when it encounters the next character that is not
a word (alphanumeric) character. In Tableau, that syntax will return the
following:
It’s
exactly what we want out of this set of data. Notice that it stops at the comma
as the REGEXP expression is looking for a whole ‘word’ in this case and is not
looking for any additional characters or spaces. If you were to leave off the +
sign, using just \w then you’d have the following match which is correct
because it is a singular “word character” match:
See,
that was easy!
First
Name
What
about the next problem statement/goal? To get their first name? It’s located
after a space and a comma. Those will be our “anchors” off which we’ll work.
We’ll use the following syntax:
REGEXP_EXTRACT([Text], ',\s+(\w+)')
For
this situation, we’re stepping outside of our “capture group”. Why? Because we
need to tell the regular expression to start at a specific point in the text—in our case, we
need to look forward from the comma and the space. Let’s break this syntax down
into its component parts.
,
– This is just a literal string. We’re searching for a comma.
\s –
This metacharacter means “space”. So we’re searching for a space. Combined with
the comma, we’re now looking for a comma followed by a space.
+
– This means we’re going to grab the word capture group right after the match
of the command and space.
Could
the \s+
be written as just \s? Yes, it could but I’m a
double-space person so the + will cause the expression to check for any
additional spaces between that comma and the capture group—just in case.
Coming Soon
I hope this provided a good, Tableau-focused introduction to regular expressions. But we’re not done yet! In the next blog in this series, I’ll share some more practice examples that I see regularly in my work and on the Tableau Community Forums.
Thanks for sharing. Looking forward to a blog on backreferencing and looping/walking greedily thru a list of strings just like this. Also non-capture groups are a bit confusing for me to use in Tableau. Keep up the good work
ReplyDeleteThis is a great post, thank you
ReplyDeleteGood Post.. Thank you
ReplyDeleteAwesome post, as usual. I might add in the next post, how to test a RegEx function in regexr.com, and how to past the expression in a Tableau RegEx function.
ReplyDeleteCheerio,
Franco
Thanks thanks thanks!
ReplyDeleteThis is awesome. thanks
ReplyDelete