Pattern Matching
Pattern Matching
Patterns
A pattern is a sequence of pattern items (see below).A ^ at the beginning of a pattern anchors the match at the beginning of the subject string.A $ at the end of a pattern anchors the match at the end of the subject string.At other positions, ^ and $ have no special meaning and represent themselves.
Pattern Items
There are three main types of pattern items: Character class patterns. A pattern item may be a single character class that matches any single character in the class. It can be optionally followed by a sufflx:
* |
0 or more repetitions,matches longest possible sequence |
+ |
1 or more repetitions,matches longest possible sequence |
- |
0 or more repetitions,matches shortest possible sequence |
? |
0 or 1 occurrence |
|
|
Captured patterns.
A pattern item can also be in the form %n, for n between 1 and 9; such an item matches a sub-string equal to the n-th captured string.
Balanced patterns.
The flnal form of a pattern item is %bxy, where x and y are two distinct characters; this matches strings that start with x, end with y, where the x and y are balanced. Pairs of string delimiters and parentheses in arithmetic expressions commonly exhibit this trait, and may be matched using such a pattern item. E.g. “%b<>”.
Captures
If a pattern contains sub-patterns enclosed in parentheses, they describe captures.
local date = "17/7/1990"
_, _, d, m, y = strfind(date, "(%d+)/(%d+)/(%d+)")
print(d, m, y) --> 17 7 1990
When a match succeeds, the sub-strings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses, starting from 1. Captured strings can be used in further matches or in substitutions.
Character Classes
A character class is used to represent a set of characters. The following combinations are allowed in describing a character class:
%a |
letters |
%s |
space characters |
%c |
control characters |
%u |
upper case letters |
%d |
digits |
%w |
alphanumeric characters |
%l |
lower case letters |
%x |
hexadecimal digits |
%p |
punctuation characters |
%z |
character with representation 0 |
A pattern cannot contain embedded zeros.Use %z instead.
- x
- Represents a literal character,where x is a non-magic character (^$()%.[]*+-?)
- .
- A dot represents all characters
- %x
- Represents the character x, where x is any non-alphanumeric character; used to escape the magic characters. (Any punctuation character should be preceded by a % when used to represent itself in a pattern.)
- [char-set]
- Represents the class which is the union of all characters in char-set. Ranges may be specifled using a - (dash). %x classes described above may also be used as components. All other characters represent themselves. Interaction between ranges and classes is not deflned.
- [^char-set]
- Represents the complement of char-set, where char-set is interpreted as above.
For all classes represented by single letters (%a, %c,…), the corresponding upper-case letter represents the complement of the class. For instance, %S represents all non-space characters. The deflnitions of letter, space, etc. depend on the current locale. In particular, the class [a-z] may not be equivalent to %l. The second form should be preferred for portability.