Homespacer>Support>spacerFree regex checker

spacer

Free Regular Expression Tester

The RegEx Lab in Mergemill Pro is a free regex test tool to easily learn and test regular expressions

Share via Email Email

print friendly Print / PDF

Share on Facebook Facebook

Share on Twitter Twitter

space
A regular expression, or regex, is a pattern of text consisting of ordinary characters and metacharacters, which together describes the strings to match when searching and replacing text. With regular expressions, you can quickly search for specific characters and search by position.

This introduction is specific to Xojo's implementation of regular expressions in Mergemill Pro. However, most of the following ideas and syntax should also apply to other implementations.

To enable you to easily test and debug your regexes, Mergemill Pro features a regular expression tester called the RegEx Lab. If you are not familiar with regular expressions, the best way to start is to follow this introduction and try out every regex pattern here described in the RegEx Lab. Since you may keep on using the RegEx Lab without any restriction on a pre-registered copy of Mergemill Pro, this essentially makes the software a FREE regex test tool for mastering regular expressions.


Basic Regex Patterns

Pattern

Description

abc

Matches all of a, b, and c in order. For example, regex "123" matches "01123345".

a|b|c

Matches one of a, b, or c. For example, regex "1|2|3" matches "01123345".

[a-z0-9]

Matches any single character of the set enclosed in square brackets. Examples: [aeiou] matches any one of the vowels. [a-zA-Z0-9] matches any alphanumeric character. [a-e] matches any character in the range a-e, inclusive. To match a "-", place it at the beginning or end of the set. For example, [a-c-] finds a character in the range a-c or the "-" sign. Other useful patterns are: "[[]" finds a "[". "[]]" finds a "]".

[^a-z0-9]

Matches any single character NOT in the set. For example, [^aeiou] matches any character EXCEPT a vowel. To find the caret character, place it anywhere except the first position after the opening bracket. For example, [a-e^] finds a character in the range a-e or the caret character.

\d

Matches a digit. Same as [0-9].

\D

Matches a non-digit. Same as [^0-9].

\w

Matches an alphanumeric (word) character. Same as [a-zA-Z0-9_].

\W

Matches a non-word character. Same as [^a-zA-Z0-9_].

\s

Matches a whitespace character (space, tab, return, line feed, form feed).

\S

Matches a non-whitespace character. Please note that [\D\S] is NOT the same as [^\d\s]. In fact, [\D\S] matches ANYTHING.

\n

Matches a newline (or line feed).

\r

Matches a return.

\t

Matches a tab.

\f

Matches a formfeed.

\0

Matches a null character.

\000

Also matches a null character. This is a specific case of \nnn.

\nnn

Matches an ASCII character of the octal value nnn. "\15" is the same as "\r".

\xnn

Matches an ASCII character of the hexadecimal value nn. So another way of searching for the return character is to use \xD.

\cX

Matches an ASCII control character. The letter after the backslash is always a lowercase c. The second letter is an uppercase letter A through Z, to indicate Control+A through Control+Z. These are equivalent to \x01 through \x1A.

\metachar

Matches the metacharacter, such as \., \\, and \|.

. (dot)

Matches any character except a line break. If you use the dot alone, you will select the first character in the target string and, if you repeat the search, you will find each successive character, till you encounter a line break. For example, "5.." matches "0123456789". The dot means [^\n] in Unix, [^\r\n] in Windows, and [^\r] in Mac OS. Don't use the dot if you can; your regex is more efficient if you specify more clearly the strings you want to match. Optimizing a regex is important if it is to be used repeatedly and on large chunks of data.


Metacharacters in Character Sets

The metacharacters remaining as such inside a character set are the closing bracket "]", the backslash "\", the caret "^" and the hyphen "-". Other metacharacters behave as ordinary characters, and do not need to be escaped by a backslash. To search for a star or plus for example, simply use [+*]. To include a backslash as a character without any special meaning inside a character set, you have to escape it with another backslash. So [\\x] matches a backslash or an x. The closing bracket "]", the caret "^" and the hyphen "-" can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning.


Anchors (position matching)

Char

Description

^

Matches the beginning of a line or string. For example, "^Name" finds lines that begin with "Name".

$

Matches the end of a line or string. For example, ".$" finds the last character in a line.

\b

Matches a word boundary. For example, "\bword\b" does a whole-word search.

\B

Matches a non-word boundary. It matches where \b does not.


Repetition

Char

Description

x?

Repeats x zero or one time. That is, x is optional in the strings to be matched. For example, "12?3" matches both "0123456789" and "013456789".

x*

Repeats x zero or more times in the strings to be matched. For example, "12*" matches "01222223456789".

x+

Repeats x one or more times in the strings to be matched. For example, [0-9]+ finds a string of one or more consecutive numbers, such as "32" in "Win32".

x{m,n}

Repeats x m to n times in the strings to be matched.

x{n}

Repeats x exactly n times in the strings to be matched.

x{n,}

Repeats x at least n times in the strings to be matched.


Greediness

The repetition operators (or quantifiers) are NOT greedy in Mergemill Pro. Greedy quantifiers repeat the preceding token as often as possible before the regex fails. So a greedy plus in the regex "<.+>" starts with the leftmost "<", and includes everything in the match till the last ">" in the string. This won't work if you want to find the first tag in an HTML document.

Mergemill Pro lets you control the Greedy property of the regex via a checkbox. You may also place a "?" directly after a "*" or "+" to reverse the "greediness" setting. So when applied to "aaaa" with the Greedy option selected, "a+?" returns "a" and "a+" returns "aaaa".


Grouping and Backreferences

You can group a part of a regex together by placing it inside parentheses. This allows you to apply a regex operator, such as a quantifier, to the entire group. For example, "Nov(ember)?" will match both "Nov" and "November".

Besides grouping part of a regex together, round brackets also create a "backreference". Backreferences store the parts of the string matched by the parts of the regex inside the parentheses. They can then be referenced later, or in the replacement pattern, by \1, \2, etc. for the first group matched, the second group, and so on. For example, "\b(\w+)\s+\1\b" finds double words such as "the the". If you want to match any date, write "(\d+)\s(B.C.|A.D.|BC|AD)", then \1 refers only to the year number and \2 would contain the letters.

Please note:

  1. Backreferences store the last match only, and so "([abc]+)" captures "cab" while "([abc])+" keeps only "b".
  2. The round brackets and backreferences such as \1 have NO special meanings inside [].
  3. Backreferences in search patterns must use the backslash, like \1, \2, etc., whereas in replacement patterns you may use either \1 or $1, and so on.


Replacement Patterns

Pattern

Description

$'

Replaced with the entire target string following the matched text.

$&

Same as \0 or $0, it contains the entire matched string. For example, if "\d\d\d\d\sB\.C\." finds "1541 B.C.", then the replacement pattern "the year $&" results in "the year 1541 B.C.".

$0-$50

Same as \0 to \50. They evaluate to nothing if the subexpression corresponding to the number doesn't exist, otherwise they contain the last-matched subpatterns, defined by the parentheses in the search pattern.

\xnn

Replaced with the character represented by nn in Hex.

\nnn

Replaced with the character represented by nn in Octal.

\cX

Replaced with the character that is the control version of X.


Extension Mechanism

Pattern

Description

(?#text)

Use this to insert a comment.

(?:regex)

This is for grouping without creating backreferences, and is therefore empty when called.

(?=regex)

This is a zero-width positive look-ahead assertion. For example, \w+(?=\t) matches a word followed by a tab, without including the tab in $&.

(?!regex)

This is a zero-width negative look-ahead assertion. For example foo(?!bar) matches any occurrence of "foo" that isn't followed by "bar".

(?<=regex)

This is a zero-width positive look-behind assertion. For example, (?<=\t)\w+ matches a word that follows a tab, without including the tab in $&. It works only for fixed-width look-behind regex.

(?<!regex)

This is a zero-width negative look-behind assertion. It works only for a fixed-width look-behind regex. For example "\b\w+(?<!s)\b" finds a word that does not end with an "s". Without using lookbehind, the regex becomes \b\w*[^s\W]\b

Please note:

  1. Lookaround is zero-width, i.e. as soon as the condition is satisfied, the regex engine forgets about everything inside the lookaround. It therefore does not create a backreference, and is not included in the count towards numbering the backreferences.
  2. Any valid regex can be used inside the lookahead, such as (?=regex) or (?=(regex)). If it contains capturing parentheses like the second one, the backreferences will be saved. Example: "(?=(\d+))\w+\1" will NOT match 123x12, but will match 56x56 in 456x56.


Regex Options in Mergemill Pro

One Line ignores internal newlines for the purposes of matching against "^" and "$".
Case Sensitive specifies whether case is to be considered when matching a string.
Dot Matches All sets the dot to match everything, including newlines, which it normally doesn't match.

spacer

Top of Page

Featuresspacer::spacerDownloadsspacer::spacerBuy Nowspacer::spacerSupportspacer::spacerTutorialsspacer::spacerTags Guidespacer::spacerSite Map


Copyright © 2001-2017 Cross Culture Ltd. All Rights Reserved.