This is regex.info, produced by makeinfo version 4.0 from regex.texi. INFO-DIR-SECTION C library code START-INFO-DIR-ENTRY * Regex: (regex). Regular expression library. END-INFO-DIR-ENTRY This file documents the GNU regular expression library. Copyright (C) 1992, 1993 Free Software Foundation, Inc. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided also that the section entitled "GNU General Public License" is included exactly as in the original, and provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that the section entitled "GNU General Public License" may be included in a translation approved by the Free Software Foundation instead of in the original English.  File: regex.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir) Regular Expression Library ************************** This manual documents how to program with the GNU regular expression library. This is edition 0.12a of the manual, 19 September 1992. The first part of this master menu lists the major nodes in this Info document, including the index. The rest of the menu lists all the lower level nodes in the document. * Menu: * Overview:: * Regular Expression Syntax:: * Common Operators:: * GNU Operators:: * GNU Emacs Operators:: * What Gets Matched?:: * Programming with Regex:: * Copying:: Copying and sharing Regex. * Index:: General index. --- The Detailed Node Listing --- Regular Expression Syntax * Syntax Bits:: * Predefined Syntaxes:: * Collating Elements vs. Characters:: * The Backslash Character:: Common Operators * Match-self Operator:: Ordinary characters. * Match-any-character Operator:: . * Concatenation Operator:: Juxtaposition. * Repetition Operators:: * + ? {} * Alternation Operator:: | * List Operators:: [...] [^...] * Grouping Operators:: (...) * Back-reference Operator:: \digit * Anchoring Operators:: ^ $ Repetition Operators * Match-zero-or-more Operator:: * * Match-one-or-more Operator:: + * Match-zero-or-one Operator:: ? * Interval Operators:: {} List Operators (`[' ... `]' and `[^' ... `]') * Character Class Operators:: [:class:] * Range Operator:: start-end Anchoring Operators * Match-beginning-of-line Operator:: ^ * Match-end-of-line Operator:: $ GNU Operators * Word Operators:: * Buffer Operators:: Word Operators * Non-Emacs Syntax Tables:: * Match-word-boundary Operator:: \b * Match-within-word Operator:: \B * Match-beginning-of-word Operator:: \< * Match-end-of-word Operator:: \> * Match-word-constituent Operator:: \w * Match-non-word-constituent Operator:: \W Buffer Operators * Match-beginning-of-buffer Operator:: \` * Match-end-of-buffer Operator:: \' GNU Emacs Operators * Syntactic Class Operators:: Syntactic Class Operators * Emacs Syntax Tables:: * Match-syntactic-class Operator:: \sCLASS * Match-not-syntactic-class Operator:: \SCLASS Programming with Regex * GNU Regex Functions:: * POSIX Regex Functions:: * BSD Regex Functions:: GNU Regex Functions * GNU Pattern Buffers:: The re_pattern_buffer type. * GNU Regular Expression Compiling:: re_compile_pattern () * GNU Matching:: re_match () * GNU Searching:: re_search () * Matching/Searching with Split Data:: re_match_2 (), re_search_2 () * Searching with Fastmaps:: re_compile_fastmap () * GNU Translate Tables:: The `translate' field. * Using Registers:: The re_registers type and related fns. * Freeing GNU Pattern Buffers:: regfree () POSIX Regex Functions * POSIX Pattern Buffers:: The regex_t type. * POSIX Regular Expression Compiling:: regcomp () * POSIX Matching:: regexec () * Reporting Errors:: regerror () * Using Byte Offsets:: The regmatch_t type. * Freeing POSIX Pattern Buffers:: regfree () BSD Regex Functions * BSD Regular Expression Compiling:: re_comp () * BSD Searching:: re_exec ()  File: regex.info, Node: Overview, Next: Regular Expression Syntax, Prev: Top, Up: Top Overview ******** A "regular expression" (or "regexp", or "pattern") is a text string that describes some (mathematical) set of strings. A regexp R "matches" a string S if S is in the set of strings described by R. Using the Regex library, you can: * see if a string matches a specified pattern as a whole, and * search within a string for a substring matching a specified pattern. Some regular expressions match only one string, i.e., the set they describe has only one member. For example, the regular expression `foo' matches the string `foo' and no others. Other regular expressions match more than one string, i.e., the set they describe has more than one member. For example, the regular expression `f*' matches the set of strings made up of any number (including zero) of `f's. As you can see, some characters in regular expressions match themselves (such as `f') and some don't (such as `*'); the ones that don't match themselves instead let you specify patterns that describe many different strings. To either match or search for a regular expression with the Regex library functions, you must first compile it with a Regex pattern compiling function. A "compiled pattern" is a regular expression converted to the internal format used by the library functions. Once you've compiled a pattern, you can use it for matching or searching any number of times. The Regex library consists of two source files: `regex.h' and `regex.c'. Regex provides three groups of functions with which you can operate on regular expressions. One group--the GNU group--is more powerful but not completely compatible with the other two, namely the POSIX and Berkeley UNIX groups; its interface was designed specifically for GNU. The other groups have the same interfaces as do the regular expression functions in POSIX and Berkeley UNIX. We wrote this chapter with programmers in mind, not users of programs--such as Emacs--that use Regex. We describe the Regex library in its entirety, not how to write regular expressions that a particular program understands.  File: regex.info, Node: Regular Expression Syntax, Next: Common Operators, Prev: Overview, Up: Top Regular Expression Syntax ************************* "Characters" are things you can type. "Operators" are things in a regular expression that match one or more characters. You compose regular expressions from operators, which in turn you specify using one or more characters. Most characters represent what we call the match-self operator, i.e., they match themselves; we call these characters "ordinary". Other characters represent either all or parts of fancier operators; e.g., `.' represents what we call the match-any-character operator (which, no surprise, matches (almost) any character); we call these characters "special". Two different things determine what characters represent what operators: 1. the regular expression syntax your program has told the Regex library to recognize, and 2. the context of the character in the regular expression. In the following sections, we describe these things in more detail. * Menu: * Syntax Bits:: * Predefined Syntaxes:: * Collating Elements vs. Characters:: * The Backslash Character::  File: regex.info, Node: Syntax Bits, Next: Predefined Syntaxes, Up: Regular Expression Syntax Syntax Bits =========== In any particular syntax for regular expressions, some characters are always special, others are sometimes special, and others are never special. The particular syntax that Regex recognizes for a given regular expression depends on the value in the `syntax' field of the pattern buffer of that regular expression. You get a pattern buffer by compiling a regular expression. *Note GNU Pattern Buffers::, and *Note POSIX Pattern Buffers::, for more information on pattern buffers. *Note GNU Regular Expression Compiling::, *Note POSIX Regular Expression Compiling::, and *Note BSD Regular Expression Compiling::, for more information on compiling. Regex considers the value of the `syntax' field to be a collection of bits; we refer to these bits as "syntax bits". In most cases, they affect what characters represent what operators. We describe the meanings of the operators to which we refer in *Note Common Operators::, *Note GNU Operators::, and *Note GNU Emacs Operators::. For reference, here is the complete list of syntax bits, in alphabetical order: `RE_BACKSLASH_ESCAPE_IN_LISTS' If this bit is set, then `\' inside a list (*note List Operators:: quotes (makes ordinary, if it's special) the following character; if this bit isn't set, then `\' is an ordinary character inside lists. (*Note The Backslash Character::, for what `\' does outside of lists.) `RE_BK_PLUS_QM' If this bit is set, then `\+' represents the match-one-or-more operator and `\?' represents the match-zero-or-more operator; if this bit isn't set, then `+' represents the match-one-or-more operator and `?' represents the match-zero-or-one operator. This bit is irrelevant if `RE_LIMITED_OPS' is set. `RE_CHAR_CLASSES' If this bit is set, then you can use character classes in lists; if this bit isn't set, then you can't. `RE_CONTEXT_INDEP_ANCHORS' If this bit is set, then `^' and `$' are special anywhere outside a list; if this bit isn't set, then these characters are special only in certain contexts. *Note Match-beginning-of-line Operator::, and *Note Match-end-of-line Operator::. `RE_CONTEXT_INDEP_OPS' If this bit is set, then certain characters are special anywhere outside a list; if this bit isn't set, then those characters are special only in some contexts and are ordinary elsewhere. Specifically, if this bit isn't set then `*', and (if the syntax bit `RE_LIMITED_OPS' isn't set) `+' and `?' (or `\+' and `\?', depending on the syntax bit `RE_BK_PLUS_QM') represent repetition operators only if they're not first in a regular expression or just after an open-group or alternation operator. The same holds for `{' (or `\{', depending on the syntax bit `RE_NO_BK_BRACES') if it is the beginning of a valid interval and the syntax bit `RE_INTERVALS' is set. `RE_CONTEXT_INVALID_OPS' If this bit is set, then repetition and alternation operators can't be in certain positions within a regular expression. Specifically, the regular expression is invalid if it has: * a repetition operator first in the regular expression or just after a match-beginning-of-line, open-group, or alternation operator; or * an alternation operator first or last in the regular expression, just before a match-end-of-line operator, or just after an alternation or open-group operator. If this bit isn't set, then you can put the characters representing the repetition and alternation characters anywhere in a regular expression. Whether or not they will in fact be operators in certain positions depends on other syntax bits. `RE_DOT_NEWLINE' If this bit is set, then the match-any-character operator matches a newline; if this bit isn't set, then it doesn't. `RE_DOT_NOT_NULL' If this bit is set, then the match-any-character operator doesn't match a null character; if this bit isn't set, then it does. `RE_INTERVALS' If this bit is set, then Regex recognizes interval operators; if this bit isn't set, then it doesn't. `RE_LIMITED_OPS' If this bit is set, then Regex doesn't recognize the match-one-or-more, match-zero-or-one or alternation operators; if this bit isn't set, then it does. `RE_NEWLINE_ALT' If this bit is set, then newline represents the alternation operator; if this bit isn't set, then newline is ordinary. `RE_NO_BK_BRACES' If this bit is set, then `{' represents the open-interval operator and `}' represents the close-interval operator; if this bit isn't set, then `\{' represents the open-interval operator and `\}' represents the close-interval operator. This bit is relevant only if `RE_INTERVALS' is set. `RE_NO_BK_PARENS' If this bit is set, then `(' represents the open-group operator and `)' represents the close-group operator; if this bit isn't set, then `\(' represents the open-group operator and `\)' represents the close-group operator. `RE_NO_BK_REFS' If this bit is set, then Regex doesn't recognize `\'DIGIT as the back reference operator; if this bit isn't set, then it does. `RE_NO_BK_VBAR' If this bit is set, then `|' represents the alternation operator; if this bit isn't set, then `\|' represents the alternation operator. This bit is irrelevant if `RE_LIMITED_OPS' is set. `RE_NO_EMPTY_RANGES' If this bit is set, then a regular expression with a range whose ending point collates lower than its starting point is invalid; if this bit isn't set, then Regex considers such a range to be empty. `RE_UNMATCHED_RIGHT_PAREN_ORD' If this bit is set and the regular expression has no matching open-group operator, then Regex considers what would otherwise be a close-group operator (based on how `RE_NO_BK_PARENS' is set) to match `)'.  File: regex.info, Node: Predefined Syntaxes, Next: Collating Elements vs. Characters, Prev: Syntax Bits, Up: Regular Expression Syntax Predefined Syntaxes =================== If you're programming with Regex, you can set a pattern buffer's (*note GNU Pattern Buffers::, and *Note POSIX Pattern Buffers::) `syntax' field either to an arbitrary combination of syntax bits (*note Syntax Bits::) or else to the configurations defined by Regex. These configurations define the syntaxes used by certain programs--GNU Emacs, POSIX Awk, traditional Awk, Grep, Egrep--in addition to syntaxes for POSIX basic and extended regular expressions. The predefined syntaxes-taken directly from `regex.h'--are: #define RE_SYNTAX_EMACS 0 #define RE_SYNTAX_AWK \ (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \ | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \ | RE_UNMATCHED_RIGHT_PAREN_ORD) #define RE_SYNTAX_POSIX_AWK \ (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS) #define RE_SYNTAX_GREP \ (RE_BK_PLUS_QM | RE_CHAR_CLASSES \ | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \ | RE_NEWLINE_ALT) #define RE_SYNTAX_EGREP \ (RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \ | RE_NEWLINE_ALT | RE_NO_BK_PARENS \ | RE_NO_BK_VBAR) #define RE_SYNTAX_POSIX_EGREP \ (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES) /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */ #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC /* Syntax bits common to both basic and extended POSIX regex syntax. */ #define _RE_SYNTAX_POSIX_COMMON \ (RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \ | RE_INTERVALS | RE_NO_EMPTY_RANGES) #define RE_SYNTAX_POSIX_BASIC \ (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM) /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this isn't minimal, since other operators, such as \`, aren't disabled. */ #define RE_SYNTAX_POSIX_MINIMAL_BASIC \ (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS) #define RE_SYNTAX_POSIX_EXTENDED \ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \ | RE_NO_BK_PARENS | RE_NO_BK_VBAR \ | RE_UNMATCHED_RIGHT_PAREN_ORD) /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */ #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \ (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \ | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \ | RE_NO_BK_PARENS | RE_NO_BK_REFS \ | RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)  File: regex.info, Node: Collating Elements vs. Characters, Next: The Backslash Character, Prev: Predefined Syntaxes, Up: Regular Expression Syntax Collating Elements vs. Characters ================================= POSIX generalizes the notion of a character to that of a collating element. It defines a "collating element" to be "a sequence of one or more bytes defined in the current collating sequence as a unit of collation." This generalizes the notion of a character in two ways. First, a single character can map into two or more collating elements. For example, the German "es-zet" collates as the collating element `s' followed by another collating element `s'. Second, two or more characters can map into one collating element. For example, the Spanish `ll' collates after `l' and before `m'. Since POSIX's "collating element" preserves the essential idea of a "character," we use the latter, more familiar, term in this document.  File: regex.info, Node: The Backslash Character, Prev: Collating Elements vs. Characters, Up: Regular Expression Syntax The Backslash Character ======================= The `\' character has one of four different meanings, depending on the context in which you use it and what syntax bits are set (*note Syntax Bits::). It can: 1) stand for itself, 2) quote the next character, 3) introduce an operator, or 4) do nothing. 1. It stands for itself inside a list (*note List Operators::) if the syntax bit `RE_BACKSLASH_ESCAPE_IN_LISTS' is not set. For example, `[\]' would match `\'. 2. It quotes (makes ordinary, if it's special) the next character when you use it either: * outside a list,(1) or * inside a list and the syntax bit `RE_BACKSLASH_ESCAPE_IN_LISTS' is set. 3. It introduces an operator when followed by certain ordinary characters--sometimes only when certain syntax bits are set. See the cases `RE_BK_PLUS_QM', `RE_NO_BK_BRACES', `RE_NO_BK_VAR', `RE_NO_BK_PARENS', `RE_NO_BK_REF' in *Note Syntax Bits::. Also: * `\b' represents the match-word-boundary operator (*note Match-word-boundary Operator::). * `\B' represents the match-within-word operator (*note Match-within-word Operator::). * `\<' represents the match-beginning-of-word operator (*note Match-beginning-of-word Operator::). * `\>' represents the match-end-of-word operator (*note Match-end-of-word Operator::). * `\w' represents the match-word-constituent operator (*note Match-word-constituent Operator::). * `\W' represents the match-non-word-constituent operator (*note Match-non-word-constituent Operator::). * `\`' represents the match-beginning-of-buffer operator and `\'' represents the match-end-of-buffer operator (*note Buffer Operators::). * If Regex was compiled with the C preprocessor symbol `emacs' defined, then `\sCLASS' represents the match-syntactic-class operator and `\SCLASS' represents the match-not-syntactic-class operator (*note Syntactic Class Operators::). 4. In all other cases, Regex ignores `\'. For example, `\n' matches `n'. ---------- Footnotes ---------- (1) Sometimes you don't have to explicitly quote special characters to make them ordinary. For instance, most characters lose any special meaning inside a list (*note List Operators::). In addition, if the syntax bits `RE_CONTEXT_INVALID_OPS' and `RE_CONTEXT_INDEP_OPS' aren't set, then (for historical reasons) the matcher considers special characters ordinary if they are in contexts where the operations they represent make no sense; for example, then the match-zero-or-more operator (represented by `*') matches itself in the regular expression `*foo' because there is no preceding expression on which it can operate. It is poor practice, however, to depend on this behavior; if you want a special character to be ordinary outside a list, it's better to always quote it, regardless.  File: regex.info, Node: Common Operators, Next: GNU Operators, Prev: Regular Expression Syntax, Up: Top Common Operators **************** You compose regular expressions from operators. In the following sections, we describe the regular expression operators specified by POSIX; GNU also uses these. Most operators have more than one representation as characters. *Note Regular Expression Syntax::, for what characters represent what operators under what circumstances. For most operators that can be represented in two ways, one representation is a single character and the other is that character preceded by `\'. For example, either `(' or `\(' represents the open-group operator. Which one does depends on the setting of a syntax bit, in this case `RE_NO_BK_PARENS'. Why is this so? Historical reasons dictate some of the varying representations, while POSIX dictates others. Finally, almost all characters lose any special meaning inside a list (*note List Operators::). * Menu: * Match-self Operator:: Ordinary characters. * Match-any-character Operator:: . * Concatenation Operator:: Juxtaposition. * Repetition Operators:: * + ? {} * Alternation Operator:: | * List Operators:: [...] [^...] * Grouping Operators:: (...) * Back-reference Operator:: \digit * Anchoring Operators:: ^ $  File: regex.info, Node: Match-self Operator, Next: Match-any-character Operator, Up: Common Operators The Match-self Operator (ORDINARY CHARACTER) ============================================ This operator matches the character itself. All ordinary characters (*note Regular Expression Syntax::) represent this operator. For example, `f' is always an ordinary character, so the regular expression `f' matches only the string `f'. In particular, it does _not_ match the string `ff'.  File: regex.info, Node: Match-any-character Operator, Next: Concatenation Operator, Prev: Match-self Operator, Up: Common Operators The Match-any-character Operator (`.') ====================================== This operator matches any single printing or nonprinting character except it won't match a: newline if the syntax bit `RE_DOT_NEWLINE' isn't set. null if the syntax bit `RE_DOT_NOT_NULL' is set. The `.' (period) character represents this operator. For example, `a.b' matches any three-character string beginning with `a' and ending with `b'.  File: regex.info, Node: Concatenation Operator, Next: Repetition Operators, Prev: Match-any-character Operator, Up: Common Operators The Concatenation Operator ========================== This operator concatenates two regular expressions A and B. No character represents this operator; you simply put B after A. The result is a regular expression that will match a string if A matches its first part and B matches the rest. For example, `xy' (two match-self operators) matches `xy'.  File: regex.info, Node: Repetition Operators, Next: Alternation Operator, Prev: Concatenation Operator, Up: Common Operators Repetition Operators ==================== Repetition operators repeat the preceding regular expression a specified number of times. * Menu: * Match-zero-or-more Operator:: * * Match-one-or-more Operator:: + * Match-zero-or-one Operator:: ? * Interval Operators:: {}  File: regex.info, Node: Match-zero-or-more Operator, Next: Match-one-or-more Operator, Up: Repetition Operators The Match-zero-or-more Operator (`*') ------------------------------------- This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o's. Since this operator operates on the smallest preceding regular expression, `fo*' has a repeating `o', not a repeating `fo'. So, `fo*' matches `f', `fo', `foo', and so on. Since the match-zero-or-more operator is a suffix operator, it may be useless as such when no regular expression precedes it. This is the case when it: * is first in a regular expression, or * follows a match-beginning-of-line, open-group, or alternation operator. Three different things can happen in these cases: 1. If the syntax bit `RE_CONTEXT_INVALID_OPS' is set, then the regular expression is invalid. 2. If `RE_CONTEXT_INVALID_OPS' isn't set, but `RE_CONTEXT_INDEP_OPS' is, then `*' represents the match-zero-or-more operator (which then operates on the empty string). 3. Otherwise, `*' is ordinary. The matcher processes a match-zero-or-more operator by first matching as many repetitions of the smallest preceding regular expression as it can. Then it continues to match the rest of the pattern. If it can't match the rest of the pattern, it backtracks (as many times as necessary), each time discarding one of the matches until it can either match the entire pattern or be certain that it cannot get a match. For example, when matching `ca*ar' against `caaar', the matcher first matches all three `a's of the string with the `a*' of the regular expression. However, it cannot then match the final `ar' of the regular expression against the final `r' of the string. So it backtracks, discarding the match of the last `a' in the string. It can then match the remaining `ar'.  File: regex.info, Node: Match-one-or-more Operator, Next: Match-zero-or-one Operator, Prev: Match-zero-or-more Operator, Up: Repetition Operators The Match-one-or-more Operator (`+' or `\+') -------------------------------------------- If the syntax bit `RE_LIMITED_OPS' is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit `RE_BK_PLUS_QM' isn't set, then `+' represents this operator; if it is, then `\+' does. This operator is similar to the match-zero-or-more operator except that it repeats the preceding regular expression at least once; *note Match-zero-or-more Operator::, for what it operates on, how some syntax bits affect it, and how Regex backtracks to match it. For example, supposing that `+' represents the match-one-or-more operator; then `ca+r' matches, e.g., `car' and `caaaar', but not `cr'.  File: regex.info, Node: Match-zero-or-one Operator, Next: Interval Operators, Prev: Match-one-or-more Operator, Up: Repetition Operators The Match-zero-or-one Operator (`?' or `\?') -------------------------------------------- If the syntax bit `RE_LIMITED_OPS' is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit `RE_BK_PLUS_QM' isn't set, then `?' represents this operator; if it is, then `\?' does. This operator is similar to the match-zero-or-more operator except that it repeats the preceding regular expression once or not at all; *note Match-zero-or-more Operator::, to see what it operates on, how some syntax bits affect it, and how Regex backtracks to match it. For example, supposing that `?' represents the match-zero-or-one operator; then `ca?r' matches both `car' and `cr', but nothing else.  File: regex.info, Node: Interval Operators, Prev: Match-zero-or-one Operator, Up: Repetition Operators Interval Operators (`{' ... `}' or `\{' ... `\}') ------------------------------------------------- If the syntax bit `RE_INTERVALS' is set, then Regex recognizes "interval expressions". They repeat the smallest possible preceding regular expression a specified number of times. If the syntax bit `RE_NO_BK_BRACES' is set, `{' represents the "open-interval operator" and `}' represents the "close-interval operator" ; otherwise, `\{' and `\}' do. Specifically, supposing that `{' and `}' represent the open-interval and close-interval operators; then: `{COUNT}' matches exactly COUNT occurrences of the preceding regular expression. `{MIN,}' matches MIN or more occurrences of the preceding regular expression. `{MIN, MAX}' matches at least MIN but no more than MAX occurrences of the preceding regular expression. The interval expression (but not necessarily the regular expression that contains it) is invalid if: * MIN is greater than MAX, or * any of COUNT, MIN, or MAX are outside the range zero to `RE_DUP_MAX' (which symbol `regex.h' defines). If the interval expression is invalid and the syntax bit `RE_NO_BK_BRACES' is set, then Regex considers all the characters in the would-be interval to be ordinary. If that bit isn't set, then the regular expression is invalid. If the interval expression is valid but there is no preceding regular expression on which to operate, then if the syntax bit `RE_CONTEXT_INVALID_OPS' is set, the regular expression is invalid. If that bit isn't set, then Regex considers all the characters--other than backslashes, which it ignores--in the would-be interval to be ordinary.  File: regex.info, Node: Alternation Operator, Next: List Operators, Prev: Repetition Operators, Up: Common Operators The Alternation Operator (`|' or `\|') ====================================== If the syntax bit `RE_LIMITED_OPS' is set, then Regex doesn't recognize this operator. Otherwise, if the syntax bit `RE_NO_BK_VBAR' is set, then `|' represents this operator; otherwise, `\|' does. Alternatives match one of a choice of regular expressions: if you put the character(s) representing the alternation operator between any two regular expressions A and B, the result matches the union of the strings that A and B match. For example, supposing that `|' is the alternation operator, then `foo|bar|quux' would match any of `foo', `bar' or `quux'. The alternation operator operates on the _largest_ possible surrounding regular expressions. (Put another way, it has the lowest precedence of any regular expression operator.) Thus, the only way you can delimit its arguments is to use grouping. For example, if `(' and `)' are the open and close-group operators, then `fo(o|b)ar' would match either `fooar' or `fobar'. (`foo|bar' would match `foo' or `bar'.) The matcher usually tries all combinations of alternatives so as to match the longest possible string. For example, when matching `(fooq|foo)*(qbarquux|bar)' against `fooqbarquux', it cannot take, say, the first ("depth-first") combination it could match, since then it would be content to match just `fooqbar'.  File: regex.info, Node: List Operators, Next: Grouping Operators, Prev: Alternation Operator, Up: Common Operators List Operators (`[' ... `]' and `[^' ... `]') ============================================= "Lists", also called "bracket expressions", are a set of one or more items. An "item" is a character, a character class expression, or a range expression. The syntax bits affect which kinds of items you can put in a list. We explain the last two items in subsections below. Empty lists are invalid. A "matching list" matches a single character represented by one of the list items. You form a matching list by enclosing one or more items within an "open-matching-list operator" (represented by `[') and a "close-list operator" (represented by `]'). For example, `[ab]' matches either `a' or `b'. `[ad]*' matches the empty string and any string composed of just `a's and `d's in any order. Regex considers invalid a regular expression with a `[' but no matching `]'. "Nonmatching lists" are similar to matching lists except that they match a single character _not_ represented by one of the list items. You use an "open-nonmatching-list operator" (represented by `[^'(1)) instead of an open-matching-list operator to start a nonmatching list. For example, `[^ab]' matches any character except `a' or `b'. If the `posix_newline' field in the pattern buffer (*note GNU Pattern Buffers:: is set, then nonmatching lists do not match a newline. Most characters lose any special meaning inside a list. The special characters inside a list follow. `]' ends the list if it's not the first list item. So, if you want to make the `]' character a list item, you must put it first. `\' quotes the next character if the syntax bit `RE_BACKSLASH_ESCAPE_IN_LISTS' is set. `[:' represents the open-character-class operator (*note Character Class Operators::) if the syntax bit `RE_CHAR_CLASSES' is set and what follows is a valid character class expression. `:]' represents the close-character-class operator if the syntax bit `RE_CHAR_CLASSES' is set and what precedes it is an open-character-class operator followed by a valid character class name. `-' represents the range operator (*note Range Operator::) if it's not first or last in a list or the ending point of a range. All other characters are ordinary. For example, `[.*]' matches `.' and `*'. * Menu: * Character Class Operators:: [:class:] * Range Operator:: start-end ---------- Footnotes ---------- (1) Regex therefore doesn't consider the `^' to be the first character in the list. If you put a `^' character first in (what you think is) a matching list, you'll turn it into a nonmatching list.  File: regex.info, Node: Character Class Operators, Next: Range Operator, Up: List Operators Character Class Operators (`[:' ... `:]') ----------------------------------------- If the syntax bit `RE_CHARACTER_CLASSES' is set, then Regex recognizes character class expressions inside lists. A "character class expression" matches one character from a given class. You form a character class expression by putting a character class name between an "open-character-class operator" (represented by `[:') and a "close-character-class operator" (represented by `:]'). The character class names and their meanings are: `alnum' letters and digits `alpha' letters `blank' system-dependent; for GNU, a space or tab `cntrl' control characters (in the ASCII encoding, code 0177 and codes less than 040) `digit' digits `graph' same as `print' except omits space `lower' lowercase letters `print' printable characters (in the ASCII encoding, space tilde--codes 040 through 0176) `punct' neither control nor alphanumeric characters `space' space, carriage return, newline, vertical tab, and form feed `upper' uppercase letters `xdigit' hexadecimal digits: `0'-`9', `a'-`f', `A'-`F' These correspond to the definitions in the C library's `' facility. For example, `[:alpha:]' corresponds to the standard facility `isalpha'. Regex recognizes character class expressions only inside of lists; so `[[:alpha:]]' matches any letter, but `[:alpha:]' outside of a bracket expression and not followed by a repetition operator matches just itself.  File: regex.info, Node: Range Operator, Prev: Character Class Operators, Up: List Operators The Range Operator (`-') ------------------------ Regex recognizes "range expressions" inside a list. They represent those characters that fall between two elements in the current collating sequence. You form a range expression by putting a "range operator" between two characters.(1) `-' represents the range operator. For example, `a-f' within a list represents all the characters from `a' through `f' inclusively. If the syntax bit `RE_NO_EMPTY_RANGES' is set, then if the range's ending point collates less than its starting point, the range (and the regular expression containing it) is invalid. For example, the regular expression `[z-a]' would be invalid. If this bit isn't set, then Regex considers such a range to be empty. Since `-' represents the range operator, if you want to make a `-' character itself a list item, you must do one of the following: * Put the `-' either first or last in the list. * Include a range whose starting point collates strictly lower than `-' and whose ending point collates equal or higher. Unless a range is the first item in a list, a `-' can't be its starting point, but _can_ be its ending point. That is because Regex considers `-' to be the range operator unless it is preceded by another `-'. For example, in the ASCII encoding, `)', `*', `+', `,', `-', `.', and `/' are contiguous characters in the collating sequence. You might think that `[)-+--/]' has two ranges: `)-+' and `--/'. Rather, it has the ranges `)-+' and `+--', plus the character `/', so it matches, e.g., `,', not `.'. * Put a range whose starting point is `-' first in the list. For example, `[-a-z]' matches a lowercase letter or a hyphen (in English, in ASCII). ---------- Footnotes ---------- (1) You can't use a character class for the starting or ending point of a range, since a character class is not a single character.  File: regex.info, Node: Grouping Operators, Next: Back-reference Operator, Prev: List Operators, Up: Common Operators Grouping Operators (`(' ... `)' or `\(' ... `\)') ================================================= A "group", also known as a "subexpression", consists of an "open-group operator", any number of other operators, and a "close-group operator". Regex treats this sequence as a unit, just as mathematics and programming languages treat a parenthesized expression as a unit. Therefore, using "groups", you can: * delimit the argument(s) to an alternation operator (*note Alternation Operator::) or a repetition operator (*note Repetition Operators::). * keep track of the indices of the substring that matched a given group. *Note Using Registers::, for a precise explanation. This lets you: * use the back-reference operator (*note Back-reference Operator::). * use registers (*note Using Registers::). If the syntax bit `RE_NO_BK_PARENS' is set, then `(' represents the open-group operator and `)' represents the close-group operator; otherwise, `\(' and `\)' do. If the syntax bit `RE_UNMATCHED_RIGHT_PAREN_ORD' is set and a close-group operator has no matching open-group operator, then Regex considers it to match `)'.  File: regex.info, Node: Back-reference Operator, Next: Anchoring Operators, Prev: Grouping Operators, Up: Common Operators The Back-reference Operator ("\"DIGIT) ====================================== If the syntax bit `RE_NO_BK_REF' isn't set, then Regex recognizes back references. A back reference matches a specified preceding group. The back reference operator is represented by `\DIGIT' anywhere after the end of a regular expression's DIGIT-th group (*note Grouping Operators::). DIGIT must be between `1' and `9'. The matcher assigns numbers 1 through 9 to the first nine groups it encounters. By using one of `\1' through `\9' after the corresponding group's close-group operator, you can match a substring identical to the one that the group does. Back references match according to the following (in all examples below, `(' represents the open-group, `)' the close-group, `{' the open-interval and `}' the close-interval operator): * If the group matches a substring, the back reference matches an identical substring. For example, `(a)\1' matches `aa' and `(bana)na\1bo\1' matches `bananabanabobana'. Likewise, `(.*)\1' matches any (newline-free if the syntax bit `RE_DOT_NEWLINE' isn't set) string that is composed of two identical halves; the `(.*)' matches the first half and the `\1' matches the second half. * If the group matches more than once (as it might if followed by, e.g., a repetition operator), then the back reference matches the substring the group _last_ matched. For example, `((a*)b)*\1\2' matches `aabababa'; first group 1 (the outer one) matches `aab' and group 2 (the inner one) matches `aa'. Then group 1 matches `ab' and group 2 matches `a'. So, `\1' matches `ab' and `\2' matches `a'. * If the group doesn't participate in a match, i.e., it is part of an alternative not taken or a repetition operator allows zero repetitions of it, then the back reference makes the whole match fail. For example, `(one()|two())-and-(three\2|four\3)' matches `one-and-three' and `two-and-four', but not `one-and-four' or `two-and-three'. For example, if the pattern matches `one-and-', then its group 2 matches the empty string and its group 3 doesn't participate in the match. So, if it then matches `four', then when it tries to back reference group 3--which it will attempt to do because `\3' follows the `four'--the match will fail because group 3 didn't participate in the match. You can use a back reference as an argument to a repetition operator. For example, `(a(b))\2*' matches `a' followed by two or more `b's. Similarly, `(a(b))\2{3}' matches `abbbb'. If there is no preceding DIGIT-th subexpression, the regular expression is invalid.  File: regex.info, Node: Anchoring Operators, Prev: Back-reference Operator, Up: Common Operators Anchoring Operators =================== These operators can constrain a pattern to match only at the beginning or end of the entire string or at the beginning or end of a line. * Menu: * Match-beginning-of-line Operator:: ^ * Match-end-of-line Operator:: $  File: regex.info, Node: Match-beginning-of-line Operator, Next: Match-end-of-line Operator, Up: Anchoring Operators The Match-beginning-of-line Operator (`^') ------------------------------------------ This operator can match the empty string either at the beginning of the string or after a newline character. Thus, it is said to "anchor" the pattern to the beginning of a line. In the cases following, `^' represents this operator. (Otherwise, `^' is ordinary.) * It (the `^') is first in the pattern, as in `^foo'. * The syntax bit `RE_CONTEXT_INDEP_ANCHORS' is set, and it is outside a bracket expression. * It follows an open-group or alternation operator, as in `a\(^b\)' and `a\|^b'. *Note Grouping Operators::, and *Note Alternation Operator::. These rules imply that some valid patterns containing `^' cannot be matched; for example, `foo^bar' if `RE_CONTEXT_INDEP_ANCHORS' is set. If the `not_bol' field is set in the pattern buffer (*note GNU Pattern Buffers::), then `^' fails to match at the beginning of the string. *Note POSIX Matching::, for when you might find this useful. If the `newline_anchor' field is set in the pattern buffer, then `^' fails to match after a newline. This is useful when you do not regard the string to be matched as broken into lines.  File: regex.info, Node: Match-end-of-line Operator, Prev: Match-beginning-of-line Operator, Up: Anchoring Operators The Match-end-of-line Operator (`$') ------------------------------------ This operator can match the empty string either at the end of the string or before a newline character in the string. Thus, it is said to "anchor" the pattern to the end of a line. It is always represented by `$'. For example, `foo$' usually matches, e.g., `foo' and, e.g., the first three characters of `foo\nbar'. Its interaction with the syntax bits and pattern buffer fields is exactly the dual of `^''s; see the previous section. (That is, "beginning" becomes "end", "next" becomes "previous", and "after" becomes "before".)  File: regex.info, Node: GNU Operators, Next: GNU Emacs Operators, Prev: Common Operators, Up: Top GNU Operators ************* Following are operators that GNU defines (and POSIX doesn't). * Menu: * Word Operators:: * Buffer Operators::  File: regex.info, Node: Word Operators, Next: Buffer Operators, Up: GNU Operators Word Operators ============== The operators in this section require Regex to recognize parts of words. Regex uses a syntax table to determine whether or not a character is part of a word, i.e., whether or not it is "word-constituent". * Menu: * Non-Emacs Syntax Tables:: * Match-word-boundary Operator:: \b * Match-within-word Operator:: \B * Match-beginning-of-word Operator:: \< * Match-end-of-word Operator:: \> * Match-word-constituent Operator:: \w * Match-non-word-constituent Operator:: \W  File: regex.info, Node: Non-Emacs Syntax Tables, Next: Match-word-boundary Operator, Up: Word Operators Non-Emacs Syntax Tables ----------------------- A "syntax table" is an array indexed by the characters in your character set. In the ASCII encoding, therefore, a syntax table has 256 elements. Regex always uses a `char *' variable `re_syntax_table' as its syntax table. In some cases, it initializes this variable and in others it expects you to initialize it. * If Regex is compiled with the preprocessor symbols `emacs' and `SYNTAX_TABLE' both undefined, then Regex allocates `re_syntax_table' and initializes an element I either to `Sword' (which it defines) if I is a letter, number, or `_', or to zero if it's not. * If Regex is compiled with `emacs' undefined but `SYNTAX_TABLE' defined, then Regex expects you to define a `char *' variable `re_syntax_table' to be a valid syntax table. * *Note Emacs Syntax Tables::, for what happens when Regex is compiled with the preprocessor symbol `emacs' defined.  File: regex.info, Node: Match-word-boundary Operator, Next: Match-within-word Operator, Prev: Non-Emacs Syntax Tables, Up: Word Operators The Match-word-boundary Operator (`\b') --------------------------------------- This operator (represented by `\b') matches the empty string at either the beginning or the end of a word. For example, `\brat\b' matches the separate word `rat'.  File: regex.info, Node: Match-within-word Operator, Next: Match-beginning-of-word Operator, Prev: Match-word-boundary Operator, Up: Word Operators The Match-within-word Operator (`\B') ------------------------------------- This operator (represented by `\B') matches the empty string within a word. For example, `c\Brat\Be' matches `crate', but `dirty \Brat' doesn't match `dirty rat'.  File: regex.info, Node: Match-beginning-of-word Operator, Next: Match-end-of-word Operator, Prev: Match-within-word Operator, Up: Word Operators The Match-beginning-of-word Operator (`\<') ------------------------------------------- This operator (represented by `\<') matches the empty string at the beginning of a word.  File: regex.info, Node: Match-end-of-word Operator, Next: Match-word-constituent Operator, Prev: Match-beginning-of-word Operator, Up: Word Operators The Match-end-of-word Operator (`\>') ------------------------------------- This operator (represented by `\>') matches the empty string at the end of a word.  File: regex.info, Node: Match-word-constituent Operator, Next: Match-non-word-constituent Operator, Prev: Match-end-of-word Operator, Up: Word Operators The Match-word-constituent Operator (`\w') ------------------------------------------ This operator (represented by `\w') matches any word-constituent character.  File: regex.info, Node: Match-non-word-constituent Operator, Prev: Match-word-constituent Operator, Up: Word Operators The Match-non-word-constituent Operator (`\W') ---------------------------------------------- This operator (represented by `\W') matches any character that is not word-constituent.  File: regex.info, Node: Buffer Operators, Prev: Word Operators, Up: GNU Operators Buffer Operators ================ Following are operators which work on buffers. In Emacs, a "buffer" is, naturally, an Emacs buffer. For other programs, Regex considers the entire string to be matched as the buffer. * Menu: * Match-beginning-of-buffer Operator:: \` * Match-end-of-buffer Operator:: \'  File: regex.info, Node: Match-beginning-of-buffer Operator, Next: Match-end-of-buffer Operator, Up: Buffer Operators The Match-beginning-of-buffer Operator -------------------------------------- This operator (represented by `\`') matches the empty string at the beginning of the buffer.  File: regex.info, Node: Match-end-of-buffer Operator, Prev: Match-beginning-of-buffer Operator, Up: Buffer Operators The Match-end-of-buffer Operator -------------------------------- This operator (represented by `\'') matches the empty string at the end of the buffer.  File: regex.info, Node: GNU Emacs Operators, Next: What Gets Matched?, Prev: GNU Operators, Up: Top GNU Emacs Operators ******************* Following are operators that GNU defines (and POSIX doesn't) that you can use only when Regex is compiled with the preprocessor symbol `emacs' defined. * Menu: * Syntactic Class Operators::