In this article let’s understand how we can create a regex for base64 strings and how it can be matched.
Base64 is a type of encoding that represents binary data as ASCII text. It is commonly used to encode and decode data when transmitting attachments in emails. When data is encoded with Base64, it is converted into a string of text that can be easily transmitted over the Internet. Base64 is a group of binary-to-text encoding schemes that represent binary data (more specifically, a sequence of 8-bit bytes) in sequences of 24 bits that can be represented by four 6-bit Base64 digits.
Regex (short for regular expression) is a powerful tool used for searching and manipulating text. It is composed of a sequence of characters that define a search pattern. Regex can be used to find patterns in large amounts of text, validate user input, and manipulate strings. It is widely used in programming languages, text editors, and command line tools.
Let’s look at regex expression for base64 strings. We will try a few approaches right from simple approach to RFC or advanced approach.
Regex to match Base64 string
Regular Expression-
/^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}={2})$/gm
Test string examples for the above regex-
Input String | Match Output |
---|---|
ThisIsNotBase64Because/ItIsNotMod4 | does not match |
ThisIsBase64Because/ItIsMod4 | matches |
ThisIsAlso/Base64+EvenWithPadding+== | matches |
ThisIsNotBase64+Because-ThereIsADash | does not match |
YouGetTheIdea/== | matches |
Here is a detailed explanation of the above regex-
/^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}={2})$/gm
^ asserts position at start of a line
Non-capturing group (?:[A-Za-z0-9+\/]{4})*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [A-Za-z0-9+\/]
{4} matches the previous token exactly 4 times
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
Non-capturing group (?:[A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}={2})
1st Alternative [A-Za-z0-9+\/]{4}
Match a single character present in the list below [A-Za-z0-9+\/]
{4} matches the previous token exactly 4 times
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
2nd Alternative [A-Za-z0-9+\/]{3}=
Match a single character present in the list below [A-Za-z0-9+\/]
{3} matches the previous token exactly 3 times
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
= matches the character = with index 6110 (3D16 or 758) literally (case sensitive)
3rd Alternative [A-Za-z0-9+\/]{2}={2}
Match a single character present in the list below [A-Za-z0-9+\/]
{2} matches the previous token exactly 2 times
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
= matches the character = with index 6110 (3D16 or 758) literally (case sensitive)
{2} matches the previous token exactly 2 times
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Regex for matching Base64 string as per RFC-4648
Regular Expression-
/^(?:[a-zA-Z0-9+\/]{4})*(?:|(?:[a-zA-Z0-9+\/]{3}=)|(?:[a-zA-Z0-9+\/]{2}==)|(?:[a-zA-Z0-9+\/]{1}===))$/gm
Test string examples for the above regex-
Input String | Match Output |
---|---|
ThisIsNotBase64Because/ItIsNotMod4 | does not match |
ThisIsBase64Because/ItIsMod4 | matches |
ThisIsAlso/Base64+EvenWithPadding+== | matches |
ThisIsNotBase64+Because-ThereIsADash | does not match |
YouGetTheIdea/== | matches |
Here is a detailed explanation of the above regex-
/^(?:[a-zA-Z0-9+\/]{4})*(?:|(?:[a-zA-Z0-9+\/]{3}=)|(?:[a-zA-Z0-9+\/]{2}==)|(?:[a-zA-Z0-9+\/]{1}===))$/gm
^ asserts position at start of the string
Non-capturing group (?:[a-zA-Z0-9+\/]{4})*
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z0-9+\/]
{4} matches the previous token exactly 4 times
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
Non-capturing group (?:|(?:[a-zA-Z0-9+\/]{3}=)|(?:[a-zA-Z0-9+\/]{2}==)|(?:[a-zA-Z0-9+\/]{1}===))
1st Alternative — always finds a zero-length match
2nd Alternative (?:[a-zA-Z0-9+\/]{3}=)
Non-capturing group (?:[a-zA-Z0-9+\/]{3}=)
Match a single character present in the list below [a-zA-Z0-9+\/]
{3} matches the previous token exactly 3 times
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
= matches the character = with index 6110 (3D16 or 758) literally (case sensitive)
3rd Alternative (?:[a-zA-Z0-9+\/]{2}==)
Non-capturing group (?:[a-zA-Z0-9+\/]{2}==)
Match a single character present in the list below [a-zA-Z0-9+\/]
{2} matches the previous token exactly 2 times
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
== matches the characters == literally (case sensitive)
4th Alternative (?:[a-zA-Z0-9+\/]{1}===)
Non-capturing group (?:[a-zA-Z0-9+\/]{1}===)
Match a single character present in the list below [a-zA-Z0-9+\/]
{1} matches the previous token exactly one time (meaningless quantifier)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z matches a single character in the range between A (index 65) and Z (index 90) (case sensitive)
0-9 matches a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
+ matches the character + with index 4310 (2B16 or 538) literally (case sensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
=== matches the characters === literally (case sensitive)
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Hope this article was useful to match base64 regex pattern.
In this article, we explored the world of Regular Expressions (regex) and how they can be used to match Base64 strings. Base64 encoding is a crucial method for converting binary data to ASCII text, commonly used in email attachments and data transmission over the Internet. Regex, as a powerful tool, allows us to define search patterns and manipulate text efficiently.
We examined two regex expressions for matching Base64 strings, starting from a simple approach to the more advanced RFC-4648 approach. Understanding these patterns can help developers validate and process Base64 data effectively in their projects.
By mastering regex for Base64 strings, you can enhance your text processing skills and handle data encoding and decoding tasks with confidence.