-------------------------------------------------------------------------------
--- REGULAR EXPRESSIONS MANUAL -----------------------------------------------
-------------------------------------------------------------------------------
This file is based on the original API documentation of Jeffery Stuart's
Pattern class.
-------------------------------------------------------------------------------
You can use regular expressions in highlight language definitions.
Note that the expression has to be defined as regex( <, GRUP-NUM>), where
RE is the regex string, and GROUP-NUM is an optional parameter which defines
the group number whose match should be returned.
See README which definition parameters support regular expressions.
Content:
--------
- Regex rules
- Backslashes, escapes, and quoting
- Character Classes
- Groups and capturing
- Examples
-------------------------------------------------------------------------------
Regex rules:
------------
Construct Matches
Characters
x The character x
\\ The character \
\0nn The character with octal ASCII value nn
\0nnn The character with octal ASCII value nnn
\xhh The character with hexadecimal ASCII value hh
\t A tab character
\r A carriage return character
\n A new-line character
Character Classes
[abc] Either a, b, or c
[^abc] Any character but a, b, or c
[a-zA-Z] Any character ranging from a thru z, or A thru Z
[^a-zA-Z] Any character except those ranging from a thru z, or A thru Z
[a\-z] Either a, -, or z
[a-z[A-Z]] Same as [a-zA-Z]
[a-z&&[g-i]] Any character in the intersection of a-z and g-i
[a-z&&[^g-i]] Any character in a-z and not in g-i
Predefined character classes
. Any character. Multiline matching must be compiled into the
pattern for . to match a \r or a \n. Even if multiline matching
is enabled, . will not match a \r\n, only a \r or a \n.
\d [0-9]
\D [^\d]
\s [ \t\r\n\x0B]
\S [^\s]
\w [a-zA-Z0-9_]
\W [^\w]
POSIX character classes
\p{Lower} [a-z]
\p{Upper} [A-Z]
\p{ASCII} [\x00-\x7F]
\p{Alpha} [a-zA-Z]
\p{Digit} [0-9]
\p{Alnum} [\w&&[^_]]
\p{Punct} [!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]
\p{XDigit} [a-fA-F0-9]
Boundary Matches
^ The beginning of a line. Also matches the beginning of input.
$ The end of a line. Also matches the end of input.
\b A word boundary
\B A non word boundary
\A The beginning of input
\G The end of the previous match. Ensures that a "next" match will
only happen if it begins with the character immediately
following the end of the "current" match.
\Z The end of input. Will also match if there is a single trailing
\r\n, a single trailing \r, or a single trailing \n.
\z The end of input
Greedy Quantifiers
x? x, either zero times or one time
x* x, zero or more times
x+ x, one or more times
x{n} x, exactly n times
x{n,} x, at least n times
x{,m} x, at most m times
x{n,m} x, at least n times and at most m times
Possessive Quantifiers
x?+ x, either zero times or one time
x*+ x, zero or more times
x++ x, one or more times
x{n}+ x, exactly n times
x{n,}+ x, at least n times
x{,m}+ x, at most m times
x{n,m}+ x, at least n times and at most m times
Reluctant Quantifiers
x?? x, either zero times or one time
x*? x, zero or more times
x+? x, one or more times
x{n}? x, exactly n times
x{n,}? x, at least n times
x{,m}? x, at most m times
x{n,m}? x, at least n times and at most m times
Operators
xy x then y
x|y x or y
(x) x as a capturing group
Quoting
\Q Nothing, but treat every character (including \s) literally
until a matching \E
\E Nothing, but ends its matching \Q
Special Constructs
(?:x) x, but not as a capturing group
(?=x) x, via positive lookahead. This means that the expression will
match only if it is trailed by x. It will not "eat" any of the
characters matched by x.
(?!x) x, via negative lookahead. This means that the expression will
match only if it is not trailed by x. It will not "eat" any of
the characters matched by x.
(?<=x) x, via positive lookbehind. x cannot contain any quantifiers.
(?x) x, via negative lookbehind. x cannot contain any quantifiers.
(?>x) x{1}+
Backslashes, escapes, and quoting:
----------------------------------
The backslash character ('\') serves to introduce escaped constructs, as defined
in the table above, as well as to quote characters that otherwise would be
interpreted as unescaped constructs. Thus the expression \\ matches a single
backslash and \{ matches a left brace.
It is an error to use a backslash prior to any alphabetic character that does
not denote an escaped construct; these are reserved for future extensions to the
regular-expression language. A backslash may be used prior to a non-alphabetic
character regardless of whether that character is part of an unescaped
construct.
It is necessary to double backslashes in string literals that represent regular
expressions to protect them from interpretation by a compiler. The string
literal "\b", for example, matches a single backspace character when interpreted
as a regular expression, while "\\b" matches a word boundary. The string litera
"\(hello\)" is illegal and leads to a compile-time error; in order to match the
string (hello) the string literal "\\(hello\\)" must be used.
Character Classes:
------------------
Character classes may appear within other character classes, and may be composed
by the union operator (implicit) and the intersection operator (&&). The union
operator denotes a class that contains every character that is in at least one
of its operand classes. The intersection operator denotes a class that contains
every character that is in both of its operand classes.
The precedence of character-class operators is as follows, from highest to
lowest:
1 Literal escape \x
2 Range a-z
3 Grouping [...]
4 Intersection [a-z&&[aeiou]]
5 Union [a-e][i-u]
Note that a different set of metacharacters are in effect inside a character
class than outside a character class. For instance, the regular expression .
loses its special meaning inside a character class, while the expression -
becomes a range forming metacharacter.
Groups and capturing:
---------------------
Capturing groups are numbered by counting their opening parentheses from left to
right. In the expression ((A)(B(C))), for example, there are four such groups:
1 ((A)(B(C)))
2 (A)
3 (B(C))
4 (C)
Group zero always stands for the entire expression. Note that highlight will
only evaluate the highest group number to make regular expressions more suitable
for language definitions. Use (?:) syntax to avoid a capture of the new group.
Examples:
---------
$KEYWORDS(kwa)=regex([A-Z]\w+)
Highlight identifiers beginning with a capital letter.
$KEYWORDS(kwb)=regex([$@%]\w+)
Highlight variables beginning with $, @ or %.
$KEYWORDS(kwc)=regex(\$\{(\w+)\})
or
$KEYWORDS(kwc)=regex(\$\{(\w+)\}, 1)
Highlight variable names like ${name}. Only the name is highlighted as keyword.
The grouping feature is used to achieve this effect. If no capturing group
index is defined (like in the first example above), the right-most group's match
(highest capturing index) is returned.
$KEYWORDS(kwd)=regex((\w+)\s*\()
Highlight method names. Note that grouping is used again.
---
Andre Simon
andre.simon1@gmx.de
http://www.andre-simon.de/
http://wiki.andre-simon.de/