A URL, or Uniform Resource Locator, is a string of text that specifies the location of a resource on the internet. It is typically used to identify web pages and other resources such as images and videos.
A URL consists of a protocol, a domain name, and sometimes a path to a specific resource. For example, the URL “https://www.example.com/page1.html” specifies a resource on the internet using the HTTPS protocol, at the domain “www.example.com”, and the specific resource is the file “page1.html”.
URLs are used to access resources on the internet by entering them into a web browser’s address bar or by clicking on a link. When you click on a link or enter a URL into your web browser, your computer sends a request to the server where the resource is stored, and the server sends back the resource to your computer, which your web browser then displays. In this article let’s understand how we can create a regex for URL and how regex can be matched for URL.
Regex (short for regular expression) is a powerful tool used for searching and manipulating text. It is composed of a sequence of characters that define a search pattern. Regex can be used to find patterns in large amounts of text, validate user input, and manipulate strings. It is widely used in programming languages, text editors, and command line tools.
Structure of a Website URL
The website URL should have the following criteria and structure-
- It should start with http or https
- then it has to be followed by
://
- then it may or maynot contain
www.
- then it must be followed by domain name
- then it will be followed by top level domain(TLD) like .com, .net, .io etc.,
- then it can also have query params in the url
Regex for checking if URL is valid or not
Regular Expression-
/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))(?::\d{2,5})?(?:[\/?#]\S*)?$/igm
Test string examples for the above regex-
Input String | Match Output |
---|---|
.as10 | does not match |
http://www.google.com | matches |
#@$some .qwq.eras | does not match |
https://www.debugpointer.com?name=something | matches |
debugpointer.com | does not matches |
Here is a detailed explanation of the above regex-
/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))(?::\d{2,5})?(?:[\/?#]\S*)?$/igm
^ asserts position at start of a line
Non-capturing group (?:(?:(?:https?|ftp):)?\/\/)
Non-capturing group (?:(?:https?|ftp):)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:https?|ftp)
1st Alternative https?
http matches the characters http literally (case insensitive)
s matches the character s with index 11510 (7316 or 1638) literally (case insensitive)
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
2nd Alternative ftp
ftp matches the characters ftp literally (case insensitive)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
Non-capturing group (?:\S+(?::\S*)?@)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?::\S*)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
@ matches the character @ with index 6410 (4016 or 1008) literally (case insensitive)
Non-capturing group (?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))
1st Alternative (?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
Negative Lookahead (?!(?:10|127)(?:\.\d{1,3}){3})
Assert that the Regex below does not match
Non-capturing group (?:10|127)
1st Alternative 10
10 matches the characters 10 literally (case insensitive)
2nd Alternative 127
127 matches the characters 127 literally (case insensitive)
Non-capturing group (?:\.\d{1,3}){3}
{3} matches the previous token exactly 3 times
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
\d matches a digit (equivalent to [0-9])
Negative Lookahead (?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})
Assert that the Regex below does not match
Non-capturing group (?:169\.254|192\.168)
Non-capturing group (?:\.\d{1,3}){2}
Negative Lookahead (?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})
Assert that the Regex below does not match
172 matches the characters 172 literally (case insensitive)
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:1[6-9]|2\d|3[0-1])
Non-capturing group (?:\.\d{1,3}){2}
Non-capturing group (?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])
1st Alternative [1-9]\d?
2nd Alternative 1\d\d
3rd Alternative 2[01]\d
4th Alternative 22[0-3]
Non-capturing group (?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}
{2} matches the previous token exactly 2 times
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:1?\d{1,2}|2[0-4]\d|25[0-5])
Non-capturing group (?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4])
2nd Alternative (?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?)
Non-capturing group (?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+
Non-capturing group (?:[a-z\\u00a1-\\uffff]{2,}\.?)
Non-capturing group (?::\d{2,5})?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\d matches a digit (equivalent to [0-9])
Non-capturing group (?:[\/?#]\S*)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\/?#]
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
$ asserts position at the end of a line
Global pattern flags
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Hope this article was useful to check if the string is a valid URL or not. In this article, we delved into the world of URLs and their integral role in accessing online resources. We explored the components that make up a URL, including protocols, domains, and paths. The significance of URLs in web browsing was explained, along with the process of requesting and receiving resources from servers. Additionally, we ventured into the realm of regular expressions (regex) and their potency in text manipulation and pattern recognition. The article concluded by providing a detailed breakdown of a regex designed to validate URLs and included practical test cases. By grasping these concepts, readers are equipped with a deeper understanding of URLs and how regex can be harnessed to ascertain their validity.