ICU Regular Expression Syntax

For your convenience, the regular expression syntax from the ICU documentation is included below. When in doubt, you should refer to the official ICU User Guide - Regular Expressions documentation page.

See Also
Metacharacters
CharacterDescription
\aMatch a BELL, \u0007
\AMatch at the beginning of the input. Differs from ^ in that \A will not match after a new-line within the input.
\b, outside of a [Set] Match if the current position is a word boundary. Boundaries occur at the transitions between word \w and non-word \W characters, with combining marks ignored.
\b, within a [Set]Match a BACKSPACE, \u0008.
\BMatch if the current position is not a word boundary.
\cxMatch a Control-x character.
\dMatch any character with the Unicode General Category of Nd (Number, Decimal Digit).
\DMatch any character that is not a decimal digit.
\eMatch an ESCAPE, \u001B.
\ETerminates a \Q\E quoted sequence.
\fMatch a FORM FEED, \u000C.
\GMatch if the current position is at the end of the previous match.
\nMatch a LINE FEED, \u000A.
\N{Unicode Character Name}Match the named Unicode Character.
\p{Unicode Property Name}Match any character with the specified Unicode Property.
\P{Unicode Property Name}Match any character not having the specified Unicode Property.
\QQuotes all following characters until \E.
\rMatch a CARRIAGE RETURN, \u000D.
\sMatch a white space character. White space is defined as [\t\n\f\r\p{Z}].
\SMatch a non-white space character.
\tMatch a HORIZONTAL TABULATION, \u0009.
\uhhhhMatch the character with the hex value hhhh.
\UhhhhhhhhMatch the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\wMatch a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\WMatch a non-word character.
\x{h}Match the character with hex value hhhh. From one to six hex digits may be supplied.
\xhhMatch the character with two digit hex value hh.
\XMatch a Grapheme Cluster.
\ZMatch if the current position is at the end of input, but before the final line terminator, if one exists.
\zMatch if the current position is at the end of input.
\n
Back Reference. Match whatever the nth capturing group matched. n must be a number ≥ 1 and ≤ total number of capture groups in the pattern.
Note:
Octal escapes, such as \012, are not supported.
[pattern]Match any one character from the set. See ICU Regular Expression Character Classes for a full description of what may appear in the pattern.
.Match any character.
^Match at the beginning of a line.
$Match at the end of a line.
\Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /
Operators
OperatorDescription
|Alternation. A|B matches either A or B.
*Match zero or more times. Match as many times as possible.
+Match one or more times. Match as many times as possible.
?Match zero or one times. Prefer one.
{n}Match exactly n times.
{n,}Match at least n times. Match as many times as possible.
{n,m}Match between n and m times. Match as many times as possible, but not more than m.
*?Match zero or more times. Match as few times as possible.
+?Match one or more times. Match as few times as possible.
??Match zero or one times. Prefer zero.
{n}?Match exactly n times.
{n,}?Match at least n times, but no more than required for an overall pattern match.
{n,m}?Match between n and m times. Match as few times as possible, but not less than n.
*+Match zero or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails. Possessive match.
++Match one or more times. Possessive match.
?+Match zero or one times. Possessive match.
{n}+Match exactly n times. Possessive match.
{n,}+Match at least n times. Possessive match.
{n,m}+Match between n and m times. Possessive match.
()Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?:)Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?>)Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the (?> .
(?#)Free-format comment (?#comment).
(?=)Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?!)Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<=)Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?<!)Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators).
(?ismwx-ismwx:)Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx)Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.
See Also

ICU Regular Expression Character Classes

The following was originally from ICU User Guide - UnicodeSet, but has been adapted to fit the needs of this documentation. Specifically, the ICU UnicodeSet documentation describes an ICU C++ object— UnicodeSet. The term UnicodeSet was effectively replaced with Character Class, which is more appropriate in the context of regular expressions. As always, you should refer to the original, official documentation when in doubt.

See Also

Overview

A character class is a regular expression pattern that represents a set of Unicode characters or character strings. The following table contains some example character class patterns:

PatternDescription
[a-z]The lower case letters a through z
[abc123]The six characters a, b, c, 1, 2, and 3
[\p{Letter}]All characters with the Unicode General Category of Letter.
String Values

In addition to being a set of Unicode code point characters, a character class may also contain string values. Conceptually, a character class is always a set of strings, not a set of characters. Historically, regular expressions have treated [] character classes as being composed of single characters only, which is equivalent to a string that contains only a single character.

Character Class Patterns

Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have ranges indicated by a - between two characters, as in a-z. The sequence specifies the range of all characters from the left to the right, in Unicode order. For example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity as [a c d-f m] means the same as [acd-fm].

Unicode property sets are specified by a Unicode property, such as [:Letter:]. ICU version 2.0 supports General Category, Script, and Numeric Value properties (ICU will support additional properties in the future). For a list of the property names, see the end of this section. The syntax for specifying the property names is an extension of either POSIX or Perl syntax with the addition of =value. For example, you can match letters by using the POSIX syntax [:Letter:], or by using the Perl syntax \p{Letter}. The type can be omitted for the Category and Script properties, but is required for other properties.

The following table lists the standard and negated forms for specifying Unicode properties in both POSIX or Perl syntax. The negated form specifies a character class that includes everything but the specified property. For example, [:^Letter:] matches all characters that are not [:Letter:].

Syntax StyleStandardNegated
POSIX[:type=value:][:^type=value:]
Perl\p{type=value}\P{type=value}
See Also

Character classes can then be modified using standard set operations— Union, Inverse, Difference, and Intersection.

The binary operators & and - have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equivalent to [[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]] is not the empty set, but instead the set [def]. This only really matters for the difference operation, as the intersection operation is commutative.

Another caveat with the & and - operators is that they operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for A, enclose the A in a set: [[:Lu:]-[A]].

PatternDescription
[a]The set containing a.
[a-z]The set containing a through z and all letters in between, in Unicode order.
[^a-z]The set containing all characters but a through z, that is, U+0000 through a-1 and z+1 through U+FFFF.
[[pat1][pat2]]The union of sets specified by pat1 and pat2.
[[pat1]&[pat2]]The intersection of sets specified by pat1 and pat2.
[[pat1]-[pat2]]The asymmetric difference of sets specified by pat1 and pat2.
[:Lu:]The set of characters belonging to the given Unicode category. In this case, Unicode uppercase letters. The long form for this is [:UppercaseLetter:].
[:L:]The set of characters belonging to all Unicode categories starting with L, that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:].
See Also
String Values in Character Classes

String values are enclosed in {curly brackets}. For example:

PatternDescription
[abc{def}]A set containing four members, the single characters a, b, and c and the string def
[{abc}{def}]A set containing two members, the string abc and the string def.
[{a}{b}{c}][abc]These two sets are equivalent. Each contains three items, the three individual characters a, b, and c. A {string} containing a single character is equivalent to that same character specified in any other way.

Character Quoting and Escaping in ICU Character Class Patterns

Single Quote

Two single quotes represent a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way, except for two adjacent single quotes. It is taken as literal text— special characters become non-special. These quoting conventions for ICU character classes differ from those of Perl or Java. In those environments, single quotes have no special meaning, and are treated like any other literal character.

Backslash Escapes

Outside of single quotes, certain backslashed characters have special meaning:

PatternDescription
\uhhhhExactly 4 hex digits; h in [0-9A-Fa-f]
\UhhhhhhhhExactly 8 hex digits
\xhh1-2 hex digits
\ooo1-3 octal digits; o in [0-7]
\aU+0007 BELL
\bU+0008 BACKSPACE
\tU+0009 HORIZONTAL TAB
\nU+000A LINE FEED
\vU+000B VERTICAL TAB
\fU+000C FORM FEED
\rU+000D CARRIAGE RETURN
\\U+005C BACKSLASH

Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{Lu} is the set of uppercase letters. Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal characters.

Whitespace

Whitespace, as defined by the ICU API, is ignored unless it is quoted or backslashed.

Property Values

The following property value styles are recognized:

StyleDescription
ShortOmits the =type argument. Used to prevent ambiguity and only allowed with the Category and Script properties.
MediumUses an abbreviated type and value.
LongUses a full type and value.

If the type or value is omitted, then the = equals sign is also omitted. The short style is only used for Category and Script properties because these properties are very common and their omission is unambiguous.

In actual practice, you can mix type names and values that are omitted, abbreviated, or full. For example, if Category=Unassigned you could use what is in the table explicitly, \p{gc=Unassigned}, \p{Category=Cn}, or \p{Unassigned}.

When these are processed, case and whitespace are ignored so you may use them for clarity, if desired. For example, \p{Category = Uppercase Letter} or \p{Category = uppercase letter}.

For a list of properties supported by ICU, see ICU User Guide - Unicode Properties.

See Also

Unicode Properties

The following tables list some of the commonly used Unicode Properties, which can be matched in a regular expression with \p{Property}. The tables were created from the Unicode 5.2 Unicode Character Database, which is the version used by ICU that ships with Mac OS X 10.6.

Category
LLetter
LCCasedLetter
LuUppercaseLetter
LlLowercaseLetter
LtTitlecaseLetter
LmModifierLetter
LoOtherLetter
 
PPunctuation
PcConnectorPunctuation
PdDashPunctuation
PsOpenPunctuation
PeClosePunctuation
PiInitialPunctuation
PfFinalPunctuation
PoOtherPunctuation
 
NNumber
NdDecimalNumber
NlLetterNumber
NoOtherNumber
 
MMark
MnNonspacingMark
McSpacingMark
MeEnclosingMark
 
SSymbol
SmMathSymbol
ScCurrencySymbol
SkModifierSymbol
SoOtherSymbol
 
ZSeparator
ZsSpaceSeparator
ZlLineSeparator
ZpParagraphSeparator
 
COther
CcControl
CfFormat
CsSurrogate
CoPrivateUse
CnUnassigned
Script
ArabicArmenianBalinese
BengaliBopomofoBraille
BugineseBuhidCanadian_​AboriginalCanadian_Aboriginal
CarianChamCherokee
CommonCopticCuneiform
CypriotCyrillicDeseret
DevanagariEthiopicGeorgian
GlagoliticGothicGreek
GujaratiGurmukhiHan
HangulHanunooHebrew
HiraganaInheritedKannada
KatakanaKayah_LiKharoshthi
KhmerLaoLatin
LepchaLimbuLinear_B
LycianLydianMalayalam
MongolianMyanmarNew_Tai_Lue
NkoOghamOl_Chiki
Old_ItalicOld_PersianOriya
OsmanyaPhags_PaPhoenician
RejangRunicSaurashtra
ShavianSinhalaSundanese
Syloti_NagriSyriacTagalog
TagbanwaTai_LeTamil
TeluguThaanaThai
TibetanTifinaghUgaritic
UnknownVaiYi
Extended Property Class
ASCII_Hex_DigitAlphabetic
Bidi_ControlDash
Default_​Ignorable_​Code_​PointDefault_Ignorable_Code_PointDeprecated
DiacriticExtender
Grapheme_BaseGrapheme_Extend
Grapheme_LinkHex_Digit
HyphenIDS_​Binary_​OperatorIDS_Binary_Operator
IDS_Trinary_OperatorID_Continue
ID_StartIdeographic
Join_ControlLogical_​Order_​ExceptionLogical_Order_Exception
LowercaseMath
Noncharacter_Code_PointOther_Alphabetic
Other_​Default_​Ignorable_​Code_​PointOther_Default_Ignorable_Code_PointOther_​Grapheme_​ExtendOther_Grapheme_Extend
Other_ID_ContinueOther_ID_Start
Other_LowercaseOther_Math
Other_UppercasePattern_Syntax
Pattern_White_SpaceQuotation_Mark
RadicalSTerm
Soft_DottedTerminal_​PunctuationTerminal_Punctuation
Unified_IdeographUppercase
Variation_SelectorWhite_Space
XID_ContinueXID_Start

Unicode Character Database

Unicode properties are defined in the Unicode Character Database, or UCD. From time to time the UCD is revised and updated. The properties available, and the definition of the characters they match, depend on the UCD that ICU was built with.

Note:

In general, the ICU and UCD versions change with each major operating system release.

See Also

ICU Replacement Text Syntax

Replacement Text Syntax
CharacterDescription
$n
The text of capture group n will be substituted for $n. n must be ≥ 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $.
\Treat the character following the backslash as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for $ and \, but may proceed any character. The backslash itself will not be copied to the substitution text.