# STC [cregex](../include/stc/cregex.h): Regular Expressions ## Description **cregex** is a small and fast unicode UTF8 regular expression parser. It is based on Rob Pike's non-backtracking NFA-based regular expression implementation for the Plan 9 project. See Russ Cox's articles [Implementing Regular Expressions](https://swtch.com/~rsc/regexp/) on why NFA-based regular expression engines often are superiour to the common backtracking implementations (hint: NFAs have no "bad/slow" RE patterns). The API is simple and includes powerful string pattern matches and replace functions. See example below and in the example folder. ## Methods ```c enum { /* compile-flags */ CREG_C_DOTALL = 1<<0, /* dot matches newline too: can be set/overridden by (?s) and (?-s) in RE */ CREG_C_ICASE = 1<<1, /* ignore case mode: can be set/overridden by (?i) and (?-i) in RE */ /* match-flags */ CREG_M_FULLMATCH = 1<<2, /* like start-, end-of-line anchors were in pattern: "^ ... $" */ CREG_M_NEXT = 1<<3, /* use end of previous match[0] as start of input */ CREG_M_STARTEND = 1<<4, /* use match[0] as start+end of input */ /* replace-flags */ CREG_R_STRIP = 1<<5, /* only keep the replaced matches, strip the rest */ }; cregex cregex_init(void); cregex cregex_from(const char* pattern, int cflags); /* return CREG_OK, or negative error code on failure */ int cregex_compile(cregex *self, const char* pattern, int cflags); /* num. of capture groups in regex. 0 if RE is invalid. First group is the full match */ int cregex_captures(const cregex* self); /* return CREG_OK, CREG_NOMATCH, or CREG_MATCHERROR */ int cregex_find(const cregex* re, const char* input, csview match[], int mflags); /* Search inside input string-view only */ int cregex_find_sv(const cregex* re, csview input, csview match[]); /* All-in-one search (compile + find + drop) */ int cregex_find_pattern(const char* pattern, const char* input, csview match[], int cmflags); /* Check if there are matches in input */ bool cregex_is_match(const cregex* re, const char* input); /* Replace all matches in input */ cstr cregex_replace(const cregex* re, const char* input, const char* replace); /* Replace count matches in input string-view. Optionally transform replacement with mfun. */ cstr cregex_replace_sv(const cregex* re, csview input, const char* replace, unsigned count, bool(*mfun)(int capgrp, csview match, cstr* mstr), int rflags); /* All-in-one replacement (compile + find/replace + drop) */ cstr cregex_replace_pattern(const char* pattern, const char* input, const char* replace); cstr cregex_replace_pattern_ex(const char* pattern, const char* input, const char* replace, unsigned count, bool(*mfun)(int capgrp, csview match, cstr* mstr), int rflags); void cregex_drop(cregex* self); /* destroy */ ``` ### Error codes - CREG_OK = 0 - CREG_NOMATCH = -1 - CREG_MATCHERROR = -2 - CREG_OUTOFMEMORY = -3 - CREG_UNMATCHEDLEFTPARENTHESIS = -4 - CREG_UNMATCHEDRIGHTPARENTHESIS = -5 - CREG_TOOMANYSUBEXPRESSIONS = -6 - CREG_TOOMANYCHARACTERCLASSES = -7 - CREG_MALFORMEDCHARACTERCLASS = -8 - CREG_MISSINGOPERAND = -9 - CREG_UNKNOWNOPERATOR = -10 - CREG_OPERANDSTACKOVERFLOW = -11 - CREG_OPERATORSTACKOVERFLOW = -12 - CREG_OPERATORSTACKUNDERFLOW = -13 ### Limits - CREG_MAX_CLASSES - CREG_MAX_CAPTURES ## Usage ### Compiling a regular expression ```c cregex re1 = cregex_init(); int result = cregex_compile(&re1, "[0-9]+", CREG_DEFAULT); if (result < 0) return result; const char* url = "(https?://|ftp://|www\\.)([0-9A-Za-z@:%_+~#=-]+\\.)+([a-z][a-z][a-z]?)(/[/0-9A-Za-z\\.@:%_+~#=\\?&-]*)?"; cregex re2 = cregex_from(url, CREG_DEFAULT); if (re2.error != CREG_OK) return re2.error; ... cregex_drop(&re2); cregex_drop(&re1); ``` If an error occurs ```cregex_compile``` returns a negative error code stored in re2.error. ### Getting the first match and making text replacements ```c #define i_extern // include external utf8 and cregex functions implementation. #include #include int main() { const char* input = "start date is 2023-03-01, and end date is 2025-12-31."; const char* pattern = "\\b(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d)\\b"; cregex re = cregex_from(pattern, CREG_DEFAULT); // Lets find the first date in the string: csview match[4]; // full-match, year, month, date. if (cregex_find(&re, input, match, CREG_DEFAULT) == CREG_OK) printf("Found date: %.*s\n", c_ARGSV(match[0])); else printf("Could not find any date\n"); // Lets change all dates into US date format MM/DD/YYYY: cstr us_input = cregex_replace(&re, input, "$2/$3/$1"); printf("US input: %s\n", cstr_str(&us_input)); // Free allocated data cstr_drop(&us_input); cregex_drop(&re); } ``` For a single match you may use the all-in-one function: ```c if (cregex_find_pattern(pattern, input, match, CREG_DEFAULT)) printf("Found date: %.*s\n", c_ARGSV(match[0])); ``` To compile, use: `gcc first_match.c src/cregex.c src/utf8code.c`. In order to use a callback function in the replace call, see `examples/regex_replace.c`. ### Iterate through regex matches, *c_FORMATCH* To iterate multiple matches in an input string, you may use ```c csview match[5] = {0}; while (cregex_find(&re, input, match, CREG_M_NEXT) == CREG_OK) c_FORRANGE (k, cregex_captures(&re)) printf("submatch %lld: %.*s\n", k, c_ARGSV(match[k])); ``` There is also a safe macro which simplifies this: ```c c_FORMATCH (it, &re, input) c_FORRANGE (k, cregex_captures(&re)) printf("submatch %lld: %.*s\n", k, c_ARGSV(it.match[k])); ``` ## Using cregex in a project The easiest is to `#define i_extern` before `#include `. Make sure to do that in one translation unit only. For reference, **cregex** uses the following files: - `stc/cregex.h`, `stc/utf8.h`, `stc/csview.h`, `stc/cstr.h`, `stc/ccommon.h`, `stc/forward.h` - `src/cregex.c`, `src/utf8code.c`. ## Regex Cheatsheet | Metacharacter | Description | STC addition | |:--:|:--:|:--:| | ***c*** | Most characters (like c) match themselve literally | | | \\***c*** | Some characters are used as metacharacters. To use them literally escape them | | | . | Match any character, except newline unless in (?s) mode | | | ? | Match the preceding token zero or one time | | | * | Match the preceding token as often as possible | | | + | Match the preceding token at least once and as often as possible | | | \| | Match either the expression before the \| or the expression after it | | | (***expr***) | Match the expression inside the parentheses. ***This adds a capture group*** | | | [***chars***] | Match any character inside the brackets. Ranges like a-z may also be used | | | \[^***chars***\] | Match any character not inside the bracket. | | | \x{***hex***} | Match UTF8 character/codepoint given as a hex number | * | | ^ | Start of line anchor | | | $ | End of line anchor | | | \A | Start of input anchor | * | | \Z | End of input anchor | * | | \z | End of input including optional newline | * | | \b | UTF8 word boundary anchor | * | | \B | Not UTF8 word boundary | * | | \Q | Start literal input mode | * | | \E | End literal input mode | * | | (?i) (?-i) | Ignore case on/off (override global) | * | | (?s) (?-s) | Dot matches newline on/off (override global) | * | | \n \t \r | Match UTF8 newline, tab, carriage return | | | \d \s \w | Match UTF8 digit, whitespace, alphanumeric character | | | \D \S \W | Do not match the groups described above | | | \p{Alnum} | Match UTF8 alpha numeric | * | | \p{XDigit} | Match UTF8 hex number | * | | \p{Alpha} or \p{LC} | Match UTF8 cased letter | * | | \p{Digit} or \p{Nd} | Match UTF8 numeric | * | | \p{Lower} or \p{Ll} | Match UTF8 lower case | * | | \p{Upper} or \p{Lu} | Match UTF8 upper case | * | | \p{Space} or \p{Sz} | Match UTF8 whitespace | * | | \P{***Class***} | Do not match the classes described above | * | | [:alnum:] [:alpha:] [:ascii:] | Match ASCII character class. NB: only to be used inside [] brackets | * | | [:blank:] [:cntrl:] [:digit:] | " | * | | [:graph:] [:lower:] [:print:] | " | * | | [:punct:] [:space:] [:upper:] | " | * | | [:xdigit:] [:word:] | " | * | | [:^***class***:] | Match character not in the ASCII class | * | | $***n*** | *n*-th substitution backreference to capture group. ***n*** in 0-9. $0 is the entire match. | * | | $***nn;*** | As above, but can handle ***nn*** < CREG_MAX_CAPTURES. | * | ## Limitations The main goal of **cregex** is to be small and fast with limited but useful unicode support. In order to reach these goals, **cregex** currently does not support the following features (non-exhaustive list): - In order to limit table sizes, most general UTF8 character classes are missing, like \p{L}, \p{S}, and all specific scripts like \p{Greek} etc. Some/all of these may be added in the future as an alternative source file with unicode tables to link with. - {n, m} syntax for repeating previous token min-max times. - Non-capturing groups - Lookaround and backreferences If you need a more feature complete, but bigger library, use [RE2 with C-wrapper](https://github.com/google/re2) which uses the same type of regex engine as **cregex**, or use [PCRE2](https://www.pcre.org/).