|
Regular expressions are a convenient means of specifying certain simple (though possibly infinite) sets of strings. They are of practical interest because regular expressions can be used to specify the structure of the tokens used in a programming language. In particular, regular expressions can be used to program a scanner generator.
The sets of strings defined by regular expressions are termed regular sets. We start with a finite character set, or vocabulary (denoted V). This vocabulary is normally the character set used to form tokens. An empty or null string is allowed (denoted λ, "lambda"). Strings are built from characters in V via concatenation (for example, :=, begin, 123).The null string, when concatenated with any string S, yields S. That is, Sλ ≡ λS ≡ S.
Concatenation can be extended to sets of strings as follows: Let P and Q be sets of strings. Then string S ∈ (P Q) if and only if S can be broken into two pieces: S = S1 S2, such that S1 ∈ P and S2 ∈ Q. Small finite sets are veniently represented by listing their elements, which can be individual characters or strings of charaters. Parentheses are used to delimit expressions, and |, the alternation operator, is used to separate alternatives.
The characters (, ), , and | are meta-characters (punctuation and regular expression operators). Meta-characters can be quoted when used as ordinal characters to avoid ambiguity. (Any character or string may be quoted, but unnecessary quotation is avoided to enhance readability.) For example:
Delim = (( | ) | := | ; | , | + | - | * | / | = / $$$$)
Alternation can be extended to sets of strings. Let P and Q be sets of strings. Then string S ∈ (P | Q) if and only if S ∈ P or S ∈ Q. Large (or infinite) sets are conveniently represented by operations on finite sets of characters and strings. Concatenation and alternation may be used. A third operation, Kleene closure, is also allowed. The operator * will represent the postfix Kleene closure operator. Let P be a set of strings. String S ∈ P* if and only if S can be broken into zero or more pieces: S = S1 S2 S3
Sn such that each Si ∈ P. A string in P* is the concatenation of zero or more selections (possibly repeated) from P. (We explicitly allow n = 0, so that λ is always in P*.)
Regular expressions are defined as follows. Each regular expression denotes a set of strings ( a regular set).
- Ø is a regular expression denoting the empty set (the set containing no strings).
- λ is a regular expression denoting the set that contains only the empty string. Note that this set is not the same as the empty set, because it contains one element.
- A string S is a regular expression denoting a set containing only S. If S contains meta-characters, S can be quoted to avoid ambiguity.
- If A and B are regular expressions, denoting the alternation, concatenation, and Kleene closure of the corresponding regular sets.
Any finite set of strings can be represented by a regular expression of the form (S1 | S2 |
| Sn).
We often utilize the following operations as notational convenience. They are not strictly necessary, because their effect can be obtained (albeit somewhat clumsily) using the three standard regular operators (alternation, concatenation, kleene closure):
- P+ denotes the strings consisting of one or more strings in P concatenated together: P+ = (P+ | λ) and P+ = P P*.
- If A is a set of characters, Not(A) denotes (V A); that is, all characters in V not included in A. Since Not(A) is finite, it is trivially regular. It is possible to extend Not to strings, rather that just V. That is, if S is a set or strings, we can define Note(S) to be (V* S). Although it may be infinite, this set is also regular.
- If k is a constant, the set Ak represents all strings formed by concatenating k (not necessarily different) strings from A. That is, Ak = (A A A
) (k copies).
We now illustrate how regular expression can be used to specify tokens. Let
D = (0 |
| 9) L = (A |
| Z)
Then
- A comment that begins with and ends with Eol (end of line) can be defined as:
Comment = Not(Eol)* Eol
- A fixed decimal literal can be defined as:
Lit = D+ . D+
- An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores, can be defined as:
ID = L (L | D)* (_ (L | D)+ )*
- A more complicated example is a comment delimited by ## markers which allows single #s within the comment body:
Comment2 = ## ((# | λ) Not(#))* ##
All finite sets and many infinite sets, such as those just listed, are regular. But not all infinite sets are regular. For example, consider { [k]k | I ≥ 1}, which is the set of balanced brackets of the form [[[[[
]]]]]. This is a well-known set that is not regular. The problem is that any regular set either does not get all balanced nesting or it includes extra, unwanted strings.
It is easy to write a context-free grammar (described in the next section) that defines balanced brackets precisely. All regular sets can be defined by context-free grammars; thus, the bracket example shows that context-free grammars are more powerful descriptive mechanism that regular expressions. Regular expressions are, however, quite adequate for specifying token-level syntax.
If you have any queries, please do not hesitate to contact me.
|