regExp :: = branch ('|' branch) *

branch :: = piece*

piece :: = atom quantifier?

quantifier :: = [?*+] | ('{'quantity'}')

quantity :: = quantRange | quantMin | QuantExact

quantRange :: = QuantExact ',' QuantExact

quantMin :: = QuantExact ','

QuantExact :: = [0-9]+

atom :: = Char | charClass | ('(' regExp ')')

Char :: = [^.\?*+()|#x5B#x5D]

charClass ::= charClassEsc | charClassExpr

charClassExpr ::= '[' charGroup ']'

charGroup ::= posCharGroup | negCharGroup | charClassSub

posCharGroup ::= ( charRange | charClassEsc ) +

negCharGroup ::= '^' posCharGroup

charClassSub ::= ( posCharGroup | negCharGroup ) '-' charClassExpr

charRange ::= seRange | XmlCharRef | XmlCharIncDash

seRange ::= charOrEsc '-' charOrEsc

XmlCharRef ::= ( '&#' [0-9]+ ';' ) | ('&#x' [0-9a-fA-F]+ ';' )

charOrEsc ::= XmlChar | SingleCharEsc

XmlChar ::= [^\#x2D#x5B#x5D]

XmlCharIncDash ::= [^\#x5B#x5D]

 

charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc )

SingleCharEsc ::= '\' [nrt\|.?*+(){} #x2Dx5B#X5D#x5E]

catEsc ::= '\p{' charProp '}'

complEsc ::= '\P{' charProp '}'

charProp ::= IsCategory | IsBlock

IsCategory ::= Letters | Marks | Numbers | Punctuation | Separators |

Symbols | Others

Letters ::= 'L' [ultmo]?

Marks ::= 'M' [nce]?

Numbers ::= 'N' [dlo]?

Punctuation ::= 'P' [cdseifo]?

Separators ::= 'Z' [slp]?

Symbols ::= 'S' [mcko]?

Others ::= 'O' [cfon]?

IsBlock ::= 'Is' [a-zA-Z0-9#x2D]+

MultiCharEsc::='.' | ('\' [sSiIcCdDwW])(


XML Schema - Regular Expressions

A regular expression is a pattern for identifying a range of string values. This pattern conforms to a specific grammar. The Schema Recommendation suggests that an XML validator should implement "Level 1" regular expressions as defined in the Unicode Regular Expression Guidelines.

In this text, the term expression (without "regular") indicates a regular expression snippet, or a subset of a regular expression. An expression may match one or many characters. An expression may comprise an entire regular expression.

XML Schema regular expressions are similar to other well-known regular expressions, such as might be found in UNIX or Perl.


Examples:

  • ·        An XML schema
  • ·        A corresponding XML instance

 

XML Schema - Regular Expression - Meta Characters

 

Metacharacter

Description

Regular Expression

Sample Match

.

Match any character as defined by The Unicode Standard.

a.c

"aXc"
"a9c"

\

Precedes a metacharacter (to specify that character) or specifies a single- or multiple-character escape sequence.

\*\d*\*

"*1234*"

?

Zero or one occurrences.

ab?c

"ac"
"abc"

*

Zero or more occurrences.

ab*c

"ac"
"abc"
"abbbbbc"

+

One or more occurrences.

ab+c

"abc"
"abbbbbc"

|

The "or" operator

ab|cd

"ab"
"cd"

(

Start grouping.

a(b|c)d

"abd"
"acd"

)

End grouping.

a(b|c)d

"abd"
"acd"

[

Start range.

xx[A-Z]*xx

"xxABCDxx"

]

End range.

xx[A-Z]*xx

"xxABCDxx"

 

XML Schema - Regular Expressions - Individual Characters

XML Schema - Regular Expressions - Normal Characters

A regular expression that contains the majority of characters that one normally types into a keyboard (e.g., 'qwerty') matches exactly those characters. The rest of this page describes special escape sequences and more. Be aware that negating a regular expression with "normal" characters might provide surprising results. For example, The regular expression '[^A-Z]' matches, among other things, a Greek or Japanese letter.

XML Schema - Regular Expressions - Single Character Escape Sequence

 

Single Character Escape Sequence

Description

\n

New line character (
): line feed

\r

Return character (
): carriage return

\t

Tab character (	)

\\

\

\|

|

\.

.

\-

-

\^

^

\?

?

\*

*

\+

+

\{

{

\}

}

\(

(

\)

)

\[

[

\]

]

 

XML Schema - Regular Expressions - Multiple Character Escape Sequences

Multiple Character Escape Sequences

Description

.

Any character except '\n' (newline) and '\r' (return).

\s

Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and '\r' (return).

\S

Any character except those matched by '\s'.

\i

The first character in an XML identifier. Specifically, any letter, the character '_', or the character ':', See the XML Recommendation for the complex specification of a letter. This character represents a subset of letter that might appear in '\c'.

\I

Any character except those matched by '\i'.

\c

Any character that might appear in the built-in NMTOKEN datatype. See the XML Recommendation for the complex specification of a NameChar.

\C

Any character except those matched by '\c'.

\d

Any Decimal digit. A shortcut for '\p{Nd}'.

\D

Any character except those matched by '\d'.

\w

Any character that might appear in a word. A shortcut for '[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of "punctuation", "separator", and "other" characters).

\W

Any character except those matched by '\w'.

 

XML Schema - Regular Expressions - Character Categories

A regular expression can match a character by using a character category. The expression can be inclusive or exclusive of the character category. A regular expression must escape a character category. An inclusive character category that represents any uppercase letter looks like the following:

   \p{Lu}

An exclusive category that represents any character except an uppercase letter looks like the following:

   \P{Lu}

Note that inclusive requires a lowercase 'p', whereas exclusive requires an uppercase 'P'.

 

Character Category

Description

Notes

L

Letter, Any

 

Lu

Letter, Uppercase

 

Ll

Letter, Lowercase

 

Lt

Letter, Titlecase

 

Lm

Letter, Modifier

 

Lo

Letter, Other

 

L

Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt)

Optional in The Unicode Standard; not supported by the Schema Recommendation.

M

Mark, Any

 

Mn

Mark, Nonspacing

 

Mc

Mark, Spacing Combining

 

Me

Mark, Enclosing

 

N

Number, Any

 

Nd

Number, Decimal Digit

 

Nl

Number, Letter

 

No

Number, Other

 

P

Punctuation, Any

 

Pc

Punctuation, Connector

 

Pd

Punctuation, Dash

 

Ps

Punctuation, Open

 

Pe

Punctuation, Close

 

Pi

Punctuation, Initial quote (may behave like Ps or Pe, depending on usage)

 

Pf

Punctuation, Final quote (may behave like Ps or Pe, depending on usage)

 

Po

Punctuation, Other

 

S

Symbol, Any

 

Sm

Symbol, Math

 

Sc

Symbol, Currency

 

Sk

Symbol, Modifier

 

So

Symbol, Other

 

Z

Separator, Any

 

Zs

Separator, Space

 

Zl

Separator, Line

 

Zp

Separator, Paragraph

 

C

Other, Any

 

Cc

Other, Control

 

Cf

Other, Format

 

Cs

Other, Surrogate (not supported by Schema Recommendation).

Explicitly not supported by Schema Recommendation.

Co

Other, Private Use

 

Cn

Other, Not Assigned (no characters in the file have this property).

 

 

XML Schema - Regular Expressions - Character Blocks

The Unicode Standard supports character blocks. A block is a range of characters set aside for a specific purpose. Some examples of these blocks are the characters for a language (such as Greek), the Braille character set, and various drawing symbols. The XML Schema Recommendation provides a regular expression mechanism for identifying characters that belong to a specific block of interest. The syntax for identifying a block is '\p{IsBlockName}', where 'BlockName' is a name from Table 14.13. Like the character categories, an uppercase 'P' (as in '\P{IsBlockName}') excludes the characters in that block.

 

Block Name

Start Code

End Code

BasicLatin

#x0000

#x007F

Latin-1Supplement

#x0080

#x00FF

LatinExtended-A

#x0100

#x017F

LatinExtended-B

#x0180

#x024F

IPAExtensions

#x0250

#x02AF

SpacingModifierLetters

#x02B0

#x02FF

CombiningDiacriticalMarks

#x0300

#x036F

Greek

#x0370

#x03FF

Cyrillic

#x0400

#x04FF

Armenian

#x0530

#x058F

Hebrew

#x0590

#x05FF

Arabic

#x0600

#x06FF

Syriac

#x0700

#x074F

Thaana

#x0780

#x07BF

Devanagari

#x0900

#x097F

Bengali

#x0980

#x09FF

Gurmukhi

#x0A00

#x0A7F

Gujarati

#x0A80

#x0AFF

Oriya

#x0B00

#x0B7F

Tamil

#x0B80

#x0BFF

Telugu

#x0C00

#x0C7F

Kannada

#x0C80

#x0CFF

Malayalam

#x0D00

#x0D7F

Sinhala

#x0D80

#x0DFF

Thai

#x0E00

#x0E7F

Lao

#x0E80

#x0EFF

Tibetan

#x0F00

#x0FFF

Myanmar

#x1000

#x109F

Georgian

#x10A0

#x10FF

HangulJamo

#x1100

#x11FF

Ethiopic

#x1200

#x137F

Cherokee

#x13A0

#x13FF

UnifiedCanadianAboriginalSyllabics

#x1400

#x167F

Ogham

#x1680

#x169F

Runic

#x16A0

#x16FF

Khmer

#x1780

#x17FF

Mongolian

#x1800

#x18AF

LatinExtendedAdditional

#x1E00

#x1EFF

GreekExtended

#x1F00

#x1FFF

GeneralPunctuation

#x2000

#x206F

SuperscriptsandSubscripts

#x2070

#x209F

CurrencySymbols

#x20A0

#x20CF

CombiningMarksforSymbols

#x20D0

#x20FF

LetterlikeSymbols

#x2100

#x214F

NumberForms

#x2150

#x218F

Arrows

#x2190

#x21FF

MathematicalOperators

#x2200

#x22FF

MiscellaneousTechnical

#x2300

#x23FF

ControlPictures

#x2400

#x243F

OpticalCharacterRecognition

#x2440

#x245F

EnclosedAlphanumerics

#x2460

#x24FF

BoxDrawing

#x2500

#x257F

BlockElements

#x2580

#x259F

GeometricShapes

#x25A0

#x25FF

MiscellaneousSymbols

#x2600

#x26FF

Dingbats

#x2700

#x27BF

BraillePatterns

#x2800

#x28FF

CJKRadicalsSupplement

#x2E80

#x2EFF

KangxiRadicals

#x2F00

#x2FDF

IdeographicDescriptionCharacters

#x2FF0

#x2FFF

CJKSymbolsandPunctuation

#x3000

#x303F

Hiragana

#x3040

#x309F

Katakana

#x30A0

#x30FF

Bopomofo

#x3100

#x312F

HangulCompatibilityJamo

#x3130

#x318F

Kanbun

#x3190

#x319F

BopomofoExtended

#x31A0

#x31BF

EnclosedCJKLettersandMonths

#x3200

#x32FF

CJKCompatibility

#x3300

#x33FF

CJKUnifiedIdeographsExtensionA

#x3400

#x4DB5

CJKUnifiedIdeographs

#x4E00

#x9FFF

YiSyllables

#xA000

#xA48F

YiRadicals

#xA490

#xA4CF

HangulSyllables

#xAC00

#xD7A3

HighSurrogates

#xD800

#xDB7F

HighPrivateUseSurrogates

#xDB80

#xDBFF

LowSurrogates

#xDC00

#xDFFF

PrivateUse

#xE000

#xF8FF

CJKCompatibilityIdeographs

#xF900

#xFAFF

AlphabeticPresentationForms

#xFB00

#xFB4F

ArabicPresentationForms-A

#xFB50

#xFDFF

CombiningHalfMarks

#xFE20

#xFE2F

CJKCompatibilityForms

#xFE30

#xFE4F

SmallFormVariants

#xFE50

#xFE6F

ArabicPresentationForms-B

#xFE70

#xFEFE

Specials

#xFEFF

#xFEFF

HalfwidthandFullwidthForms

#xFF00

#xFFEF

Specials

#xFFF0

#xFFFD

OldItalic

#x10300

#x1032F

Gothic

#x10330

#x1034F

Deseret

#x10400

#x1044F

ByzantineMusicalSymbols

#x1D000

#x1D0FF

MusicalSymbols

#x1D100

#x1D1FF

MathematicalAlphanumericSymbols

#x1D400

#x1D7FF

CJKUnifiedIdeographsExtensionB

#x20000

#x2A6D6

CJKCompatibilityIdeographsSupplement

#x2F800

#x2FA1F

Tags

#xE0000

#xE007F

PrivateUse

#xF0000

#x10FFFD

 

XML Schema - Regular Expressions - XML Character References

An expression may match a character by using the common XML character reference, which is a decimal number delimited by '&' and ';', or a hex number delimited by '&#' and ';'. For example, the uppercase letter 'Z' is referenced by the decimal representation 'Z' and the hex representation 'Z'. These numbers correspond directly to the characters documented inThe Unicode Standard.