# STC [cregex](../include/stc/cregex.h): Regular Expressions ## Description **cregex** is a small and fast unicode UTF8 regular expression parser. It is based on Rob Pike's non-backtracking NFA-based regular expression implementation for the Plan 9 project. See Russ Cox's articles [Implementing Regular Expressions](https://swtch.com/~rsc/regexp/) on why NFA-based regular expression engines often are superiour to the common backtracking implementations (hint: NFAs have no "bad/slow" RE patterns). The API is simple and includes powerful string pattern matches and replace functions. See example below and in the example folder. ## Methods ```c enum { // compile-flags cre_c_dotall = 1<<0, // dot matches newline too cre_c_caseless = 1<<1, // ignore case // match-flags cre_m_fullmatch = 1<<2, // like start-, end-of-line anchors were in pattern: "^ ... $" cre_m_next = 1<<3, // use end of previous match[0] as start of input cre_m_startend = 1<<4, // use match[0] as start+end of input // replace-flags cre_r_strip = 1<<5, // only keep the replaced matches, strip the rest }; cregex cregex_init(void); cregex cregex_from(const char* pattern, int cflags); // return 1 = success, negative = error. int cregex_compile(cregex *self, const char* pattern, int cflags); // num. of capture groups in regex. 0 if RE is invalid. First group is the full match. int cregex_captures(const cregex* self); // return 1=match, 0=nomatch, -1=error. match array size: at least num groups in RE (1+). int cregex_find(const char* input, const cregex* re, csview match[], int mflags); int cregex_find_sv(csview input, const cregex* re, csview match[]); // takes string pattern instead of re. (for one-time matches) int cregex_find_p(const char* input, const char* pattern, csview match[], int cmflags); bool cregex_is_match(const char* input, const cregex* re); cstr cregex_replace(const char* input, const cregex* re, const char* replace, unsigned count); cstr cregex_replace_ex(const char* input, const cregex* re, const char* replace, unsigned count, int rflags, bool (*mfun)(int grp, csview match, cstr* mstr)); // takes string pattern instead of re cstr cregex_replace_p(const char* input, const char* pattern, const char* replace, unsigned count); cstr cregex_replace_pe(const char* input, const char* pattern, const char* replace, unsigned count, int crflags, bool (*mfun)(int grp, csview match, cstr* mstr)); void cregex_drop(cregex* self); // destroy ``` ### Error codes - cre_success = 1 - cre_nomatch = 0 - cre_matcherror = -1 - cre_outofmemory = -2 - cre_unmatchedleftparenthesis = -3 - cre_unmatchedrightparenthesis = -4 - cre_toomanysubexpressions = -5 - cre_toomanycharacterclasses = -6 - cre_malformedcharacterclass = -7 - cre_missingoperand = -8 - cre_unknownoperator = -9 - cre_operandstackoverflow = -10 - cre_operatorstackoverflow = -11 - cre_operatorstackunderflow = -12 ### Limits - cre_MAXCLASSES - cre_MAXCAPTURES ## Usage ### Compiling a regular expression ```c cregex re1 = cregex_init(); int result = cregex_compile(&re1, "[0-9]+", 0); if (result < 0) return result; const char* url = "(https?://|ftp://|www\\.)([0-9A-Za-z@:%_+~#=-]+\\.)+([a-z][a-z][a-z]?)(/[/0-9A-Za-z\\.@:%_+~#=\\?&-]*)?"; cregex re2 = cregex_from(url, 0); if (re2.error) return re2.error; ... cregex_drop(&re2); cregex_drop(&re1); ``` If an error occurs ```cregex_compile``` returns a negative value, see error codes. ### Getting the first match ```c #define i_implement #include #include int main() { const char* input = "start date is 2023-03-01, and end date is 2025-12-31."; const char* pattern = "\\b(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d)\\b"; cregex re = cregex_from(pattern, 0); // Lets find the first date in the string: csview match[4]; // full-match, year, month, date. if (cregex_find(input, &re, match, 0) == cre_success) printf("Found date: %.*s\n", c_ARGsv(match[0])); else printf("Could not find any date\n"); // Lets change all dates into US date format MM/DD/YYYY: cstr us_input = cregex_replace(input, &re, "$2/$3/$1"); printf("US input: %s\n", cstr_str(&us_input)); // Free allocated data cstr_drop(&us_input); cregex_drop(&re); } ``` For a single match you may use the all-in-one function: ```c if (cregex_find_p(input, pattern, match, 0)) printf("Found date: %.*s\n", c_ARGsv(match[0])); ``` To compile, use: `gcc first_match.c src/cregex.c src/utf8code.c`. In order to use a callback function in the replace call, see `examples/regex_replace.c`. ### Iterate through matches, c_foreach_match To iterate multiple matches in an input string, you may use: ```c csview match[5] = {0}; while (cregex_find(input, &re, match, cre_m_next) == cre_success) { c_forrange (int, i, cregex_captures(&re)) printf("submatch %d: %.*s\n", i, c_ARGsv(match[i])); puts(""); } ``` There is also a safe macro that simplifies it a bit: ```c c_foreach_match (m, &re, input) { c_forrange (int, i, cregex_captures(&re)) printf("submatch %d: %.*s\n", i, c_ARGsv(m.ref[i])); puts(""); } ``` ## Using cregex in a project **cregex** uses the following files: - `stc/cregex.h`, `stc/utf8.h`, `stc/csview.h`, `stc/cstr.h`, `stc/ccommon.h`, `stc/forward.h` - `src/cregex.c`, `src/utf8code.c`. ## Regex Cheatsheet | Metacharacter | Description | STC addition | |:--:|:--:|:--:| | c | Most characters (like c) match themselve literally | | | \c | Some characters are used as metacharacters. To use them literally escape them | | | . | Match any character, except newline unless in (?s) mode | | | ? | Match the preceding token zero or one time | | | * | Match the preceding token as often as possible | | | + | Match the preceding token at least once and as often as possible | | | \| | Match either the expression before the \| or the expression after it | | | (c) | Match the expression inside the parentheses. This adds a capture group | | | [c] | Match all characters inside the brackets. Ranges like a-z may also be used | | | [^c] | Do not match the characters inside the bracket. | | | \x{***hex***} | Match UTF8 character/codepoint given as a hex number | * | | ^ | Start of line anchor | | | $ | End of line anchor | | | \A | Start of input anchor | * | | \Z | End of input anchor | * | | \z | End of input including optional newline | * | | \b | UTF8 word boundary anchor | * | | \B | Not UTF8 word boundary | * | | \Q | Start literal input mode | * | | \E | End literal input mode | * | | (?i) (?-i) | Ignore case on/off (override global) | * | | (?s) (?-s) | Dot matches newline on/off (override global) | * | | \n \t \r | Match UTF8 newline, tab, carriage return | | | \d \s \w | Match UTF8 digit, whitespace, alphanumeric character | | | \D \S \W | Do not match the groups described above | | | \p{Space} or \p{Sz} | Match UTF8 whitespace | * | | \p{Digit} or \p{Nd} | Match UTF8 numeric | * | | \p{XDigit} | Match UTF8 hex number | * | | \p{Lower} or \p{Ll} | Match UTF8 lower case | * | | \p{Upper} or \p{Lu} | Match UTF8 upper case | * | | \p{Alpha} or \p{LC} | Match UTF8 cased letter | * | | \p{Alnum} | Match UTF8 alpha numeric | * | | \P{***class***} | Do not match the classes described above | * | | [[:alnum:]] [[:alpha:]] [[:ascii:]] | Match ASCII character class | * | | [[:blank:]] [[:cntrl:]] [[:digit:]] | Match ASCII character class | * | | [[:graph:]] [[:lower:]] [[:print:]] | Match ASCII character class | * | | [[:punct:]] [[:space:]] [[:upper:]] | Match ASCII character class | * | | [[:xdigit:]] [[:word:]] | Match ASCII character class | * | | [[:^***class***:]] | Do not match ASCII character class | * | | $***n*** | *n*-th substitution backreference to capture group. ***n*** in 0-9. $0 is the entire match. | * | | $***nn***; | As above, but can handle ***nn*** < cre_MAXCAPTURES. | * | ## Limitations The main goal of **cregex** is to be small and fast with limited but useful unicode support. In order to reach these goals, **cregex** currently does not support the following features (non-exhaustive list): - In order to limit table sizes, most general UTF8 character classes are missing, like \p{L}, \p{S}, and all specific scripts like \p{Greek} etc. Some/all of these may be added in the future as an alternative source file with unicode tables to link with. - {n, m} syntax for repeating previous token min-max times. - Non-capturing groups - Lookaround and backreferences If you need a more feature complete, but bigger library, use [RE2 with C-wrapper](https://github.com/google/re2) which uses the same type of regex engine as **cregex**, or use [PCRE2](https://www.pcre.org/).