# STC [cregex](../include/stc/cregex.h): Regular Expressions


## Description

**cregex** is a small and fast unicode UTF8 regular expression parser. It is based on Rob Pike's non-backtracking NFA-based regular expression implementation for the Plan 9 project. See Russ Cox's articles [Implementing Regular Expressions](https://swtch.com/~rsc/regexp/) on why NFA-based regular expression engines often are superiour to the common backtracking implementations (hint: NFAs have no "bad/slow" RE patterns).

The API is simple and includes powerful string pattern matches and replace functions. See example below and in the example folder.

## Methods

```c
enum {
    /* compile-flags */
    CREG_C_DOTALL = 1<<0,    /* dot matches newline too: can be set/overridden by (?s) and (?-s) in RE */
    CREG_C_ICASE = 1<<1,     /* ignore case mode: can be set/overridden by (?i) and (?-i) in RE */
    /* match-flags */
    CREG_M_FULLMATCH = 1<<2, /* like start-, end-of-line anchors were in pattern: "^ ... $" */
    CREG_M_NEXT = 1<<3,      /* use end of previous match[0] as start of input */
    CREG_M_STARTEND = 1<<4,  /* use match[0] as start+end of input */
    /* replace-flags */
    CREG_R_STRIP = 1<<5,     /* only keep the replaced matches, strip the rest */
};

cregex      cregex_init(void);
cregex      cregex_from(const char* pattern, int cflags);
            /* return CREG_OK, or negative error code on failure */
int         cregex_compile(cregex *self, const char* pattern, int cflags);

            /* num. of capture groups in regex. 0 if RE is invalid. First group is the full match */
int         cregex_captures(const cregex* self); 

            /* return CREG_OK, CREG_NOMATCH, or CREG_MATCHERROR */
int         cregex_find(const cregex* re, const char* input, csview match[], int mflags);
            /* Search inside input string-view only */
int         cregex_find_sv(const cregex* re, csview input, csview match[]);
            /* All-in-one search (compile + find + drop) */
int         cregex_find_pattern(const char* pattern, const char* input, csview match[], int cmflags);

            /* Check if there are matches in input */
bool        cregex_is_match(const cregex* re, const char* input);

            /* Replace all matches in input */
cstr        cregex_replace(const cregex* re, const char* input, const char* replace);
            /* Replace count matches in input string-view. Optionally transform replacement with mfun. */
cstr        cregex_replace_sv(const cregex* re, csview input, const char* replace, unsigned count,
                              bool(*mfun)(int capgrp, csview match, cstr* mstr), int rflags);

            /* All-in-one replacement (compile + find/replace + drop) */
cstr        cregex_replace_pattern(const char* pattern, const char* input, const char* replace);
cstr        cregex_replace_pattern_ex(const char* pattern, const char* input, const char* replace, unsigned count,
                                      bool(*mfun)(int capgrp, csview match, cstr* mstr), int rflags);

void        cregex_drop(cregex* self); /* destroy */
```

### Error codes
- CREG_OK = 0
- CREG_NOMATCH = -1
- CREG_MATCHERROR = -2
- CREG_OUTOFMEMORY = -3
- CREG_UNMATCHEDLEFTPARENTHESIS = -4
- CREG_UNMATCHEDRIGHTPARENTHESIS = -5
- CREG_TOOMANYSUBEXPRESSIONS = -6
- CREG_TOOMANYCHARACTERCLASSES = -7
- CREG_MALFORMEDCHARACTERCLASS = -8
- CREG_MISSINGOPERAND = -9
- CREG_UNKNOWNOPERATOR = -10
- CREG_OPERANDSTACKOVERFLOW = -11
- CREG_OPERATORSTACKOVERFLOW = -12
- CREG_OPERATORSTACKUNDERFLOW = -13

### Limits
- CREG_MAX_CLASSES
- CREG_MAX_CAPTURES

## Usage

### Compiling a regular expression
```c
cregex re1 = cregex_init();
int result = cregex_compile(&re1, "[0-9]+", CREG_DEFAULT);
if (result < 0) return result;

const char* url = "(https?://|ftp://|www\\.)([0-9A-Za-z@:%_+~#=-]+\\.)+([a-z][a-z][a-z]?)(/[/0-9A-Za-z\\.@:%_+~#=\\?&-]*)?";
cregex re2 = cregex_from(url, CREG_DEFAULT);
if (re2.error != CREG_OK)
    return re2.error;
...
cregex_drop(&re2);
cregex_drop(&re1);
```
If an error occurs ```cregex_compile``` returns a negative error code stored in re2.error.

### Getting the first match and making text replacements
```c
#define i_extern // include external utf8 and cregex functions implementation.
#include <stc/cregex.h>
#include <stc/cstr.h>

int main() {
    const char* input = "start date is 2023-03-01, and end date is 2025-12-31.";
    const char* pattern = "\\b(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d)\\b";

    cregex re = cregex_from(pattern, CREG_DEFAULT);

    // Lets find the first date in the string:
    csview match[4]; // full-match, year, month, date.
    if (cregex_find(&re, input, match, CREG_DEFAULT) == CREG_OK)
        printf("Found date: %.*s\n", c_ARGSV(match[0]));
    else
        printf("Could not find any date\n");

    // Lets change all dates into US date format MM/DD/YYYY:
    cstr us_input = cregex_replace(&re, input, "$2/$3/$1");
    printf("US input: %s\n", cstr_str(&us_input));

    // Free allocated data
    cstr_drop(&us_input);
    cregex_drop(&re);
}
```
For a single match you may use the all-in-one function:
```c
if (cregex_find_pattern(pattern, input, match, CREG_DEFAULT))
    printf("Found date: %.*s\n", c_ARGSV(match[0]));
```

To compile, use: `gcc first_match.c src/cregex.c src/utf8code.c`.
In order to use a callback function in the replace call, see `examples/regex_replace.c`.

### Iterate through regex matches, *c_FORMATCH*

To iterate multiple matches in an input string, you may use
```c
csview match[5] = {0};
while (cregex_find(&re, input, match, CREG_M_NEXT) == CREG_OK)
    c_FORRANGE (k, cregex_captures(&re))
        printf("submatch %lld: %.*s\n", k, c_ARGSV(match[k]));
```
There is also a safe macro which simplifies this:
```c
c_FORMATCH (it, &re, input)
    c_FORRANGE (k, cregex_captures(&re))
        printf("submatch %lld: %.*s\n", k, c_ARGSV(it.match[k]));
```

## Using cregex in a project

The easiest is to `#define i_extern` before `#include <stc/cregex.h>`. Make sure to do that in one translation unit only.

For reference, **cregex** uses the following files: 
- `stc/cregex.h`, `stc/utf8.h`, `stc/csview.h`, `stc/cstr.h`, `stc/ccommon.h`, `stc/forward.h`
- `src/cregex.c`, `src/utf8code.c`.

## Regex Cheatsheet

| Metacharacter | Description | STC addition |
|:--:|:--:|:--:|
| ***c*** | Most characters (like c) match themselve literally | |
| \\***c*** | Some characters are used as metacharacters. To use them literally escape them | |
| . | Match any character, except newline unless in (?s) mode | |
| ? | Match the preceding token zero or one time | |
| * | Match the preceding token as often as possible | |
| + | Match the preceding token at least once and as often as possible | |
| \| | Match either the expression before the \| or the expression after it | |
| (***expr***) | Match the expression inside the parentheses. ***This adds a capture group*** | |
| [***chars***] | Match any character inside the brackets. Ranges like a-z may also be used | |
| \[^***chars***\] | Match any character not inside the bracket. | |
| \x{***hex***} | Match UTF8 character/codepoint given as a hex number | * |
| ^ | Start of line anchor | |
| $ | End of line anchor | |
| \A | Start of input anchor | * |
| \Z | End of input anchor | * |
| \z | End of input including optional newline | * |
| \b | UTF8 word boundary anchor | * |
| \B | Not UTF8 word boundary | * |
| \Q | Start literal input mode | * |
| \E | End literal input mode | * |
| (?i) (?-i)  | Ignore case on/off (override global) | * |
| (?s) (?-s)  | Dot matches newline on/off (override global) | * |
| \n \t \r | Match UTF8 newline, tab, carriage return | |
| \d \s \w | Match UTF8 digit, whitespace, alphanumeric character | |
| \D \S \W | Do not match the groups described above | |
| \p{Alnum} | Match UTF8 alpha numeric | * |
| \p{XDigit} | Match UTF8 hex number | * |
| \p{Alpha} or \p{LC} | Match UTF8 cased letter | * |
| \p{Digit} or \p{Nd} | Match UTF8 numeric | * |
| \p{Lower} or \p{Ll} | Match UTF8 lower case | * |
| \p{Upper} or \p{Lu} | Match UTF8 upper case | * |
| \p{Space} or \p{Sz} | Match UTF8 whitespace | * |
| \P{***Class***} | Do not match the classes described above | * |
| [:alnum:] [:alpha:] [:ascii:] | Match ASCII character class. NB: only to be used inside [] brackets | * |
| [:blank:] [:cntrl:] [:digit:] | " | * |
| [:graph:] [:lower:] [:print:] | " | * |
| [:punct:] [:space:] [:upper:] | " | * |
| [:xdigit:] [:word:] | " | * |
| [:^***class***:] | Match character not in the ASCII class | * |
| $***n*** | *n*-th substitution backreference to capture group. ***n*** in 0-9. $0 is the entire match. | * |
| $***nn;*** | As above, but can handle ***nn*** < CREG_MAX_CAPTURES. | * |

## Limitations

The main goal of **cregex** is to be small and fast with limited but useful unicode support. In order to reach these goals, **cregex** currently does not support the following features (non-exhaustive list):
- In order to limit table sizes, most general UTF8 character classes are missing, like \p{L}, \p{S}, and all specific scripts like \p{Greek} etc. Some/all of these may be added in the future as an alternative source file with unicode tables to link with.
- {n, m} syntax for repeating previous token min-max times.
- Non-capturing groups
- Lookaround and backreferences

If you need a more feature complete, but bigger library, use [RE2 with C-wrapper](https://github.com/google/re2) which uses the same type of regex engine as **cregex**, or use [PCRE2](https://www.pcre.org/).