HTTP, or Hypertext Transfer Protocol, is a protocol for transmitting data on the internet. It is the foundation of the World Wide Web, and it is used to transfer data between a web server and a web client (usually a web browser).
An HTTP URL is a type of URL that specifies a resource on the internet using the HTTP protocol. HTTP URLs typically start with “http://” or “https://”, and they are used to access web pages and other resources on the internet.
In this article let’s understand how we can create a regex for HTTP URL and how regex can be matched for HTTP URL.
Regex (short for regular expression) is a powerful tool used for searching and manipulating text. It is composed of a sequence of characters that define a search pattern. Regex can be used to find patterns in large amounts of text, validate user input, and manipulate strings. It is widely used in programming languages, text editors, and command line tools.
Structure of a HTTP URL
The http url should have the following criteria and structure-
- It should start with http
- then it has to be followed by
://
- then it may or maynot contain
www.
- then it must be followed by domain name
- then it will be followed by top level domain(TLD) like .com, .net, .io etc.,
- then it can also have query params in the url
Regex for checking if HTTP URL is valid or not
Regular Expression-
/^(?:(?:(?:http):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))(?::\d{2,5})?(?:[\/?#]\S*)?$/igm
Test string examples for the above regex-
Input String | Match Output |
---|---|
.as10 | does not match |
http://www.google.com | matches |
#@$some .qwq.eras | does not match |
http://www.debugpointer.com | matches |
debugpointer.com | does not matches |
Here is a detailed explanation of the above regex-
/^(?:(?:(?:http|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))(?::\d{2,5})?(?:[\/?#]\S*)?$/igm
^ asserts position at start of a line
Non-capturing group (?:(?:(?:http):)?\/\/)
Non-capturing group (?:(?:http):)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:http)
http matches the characters http literally (case insensitive)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
\/ matches the character / with index 4710 (2F16 or 578) literally (case insensitive)
Non-capturing group (?:\S+(?::\S*)?@)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?::\S*)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
@ matches the character @ with index 6410 (4016 or 1008) literally (case insensitive)
Non-capturing group (?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?))
1st Alternative (?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
Negative Lookahead (?!(?:10|127)(?:\.\d{1,3}){3})
Assert that the Regex below does not match
Non-capturing group (?:10|127)
1st Alternative 10
10 matches the characters 10 literally (case insensitive)
2nd Alternative 127
127 matches the characters 127 literally (case insensitive)
Non-capturing group (?:\.\d{1,3}){3}
{3} matches the previous token exactly 3 times
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
\d matches a digit (equivalent to [0-9])
{1,3} matches the previous token between 1 and 3 times, as many times as possible, giving back as needed (greedy)
Negative Lookahead (?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})
Assert that the Regex below does not match
Non-capturing group (?:169\.254|192\.168)
1st Alternative 169\.254
2nd Alternative 192\.168
Non-capturing group (?:\.\d{1,3}){2}
{2} matches the previous token exactly 2 times
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
\d matches a digit (equivalent to [0-9])
Negative Lookahead (?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})
Assert that the Regex below does not match
172 matches the characters 172 literally (case insensitive)
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:1[6-9]|2\d|3[0-1])
Non-capturing group (?:\.\d{1,3}){2}
Non-capturing group (?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])
1st Alternative [1-9]\d?
2nd Alternative 1\d\d
3rd Alternative 2[01]\d
4th Alternative 22[0-3]
Non-capturing group (?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}
{2} matches the previous token exactly 2 times
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:1?\d{1,2}|2[0-4]\d|25[0-5])
Non-capturing group (?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))
\. matches the character . with index 4610 (2E16 or 568) literally (case insensitive)
Non-capturing group (?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4])
2nd Alternative (?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+(?:[a-z\\u00a1-\\uffff]{2,}\.?)
Non-capturing group (?:(?:[a-z0-9\\u00a1-\\uffff][a-z0-9\\u00a1-\\uffff_-]{0,62})?[a-z0-9\\u00a1-\\uffff]\.)+
Non-capturing group (?:[a-z\\u00a1-\\uffff]{2,}\.?)
Non-capturing group (?::\d{2,5})?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : with index 5810 (3A16 or 728) literally (case insensitive)
\d matches a digit (equivalent to [0-9])
Non-capturing group (?:[\/?#]\S*)?
? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\/?#]
\S matches any non-whitespace character (equivalent to [^\r\n\t\f\v ])
$ asserts position at the end of a line
Global pattern flags
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Hope this article was useful to check if the string is a valid http URL or not. In conclusion, understanding the structure and significance of HTTP URLs is essential in navigating the digital landscape. Regex, a powerful text manipulation tool, can be harnessed to validate and match HTTP URLs effectively. By crafting a comprehensive regex pattern that considers various URL components, developers can ensure the integrity of URLs in their applications, enhancing user experience and data integrity.