sanitize

sanitize(source: text) 🡒 text, pure

There are cases where the text that the user sees is different from the text that user expects when working with it or processing it: The text might contain so-called invisible characters that can appear like regular space characters on the screen or they do not appear at all, hence truly becoming invisible. Such invisible characters can for example be tabulators, no-break spaces or Hangul Fillers. A list of invisible characters that are supported by the sanitize function will be provided in the next section. For further study, an exhaustive list of invisible Unicode characters can be found on https://invisible-characters.com.

Envision’s escape function is very useful in making such characters visible by replacing the character itself with its Unicode notation, but it can become cumbersome to clean up the text and remove or replace those characters for safer processing because that can require several replace operations for each of the invisible characters contained in the source text.

The sanitize function removes all supported invisible characters from the source argument and returns a cleaned-up version of the source text.

Example

This example illustrates using the function with accepted characters:

t = "Sanitizing a no-break-space\u00A0, a Hangul Filler\u3164, and a tab\t."

l=strlen(t)
show scalar "The untouched text source. \{l} Characters." a1e1 with t

show scalar "Revealing the invisible characters. \{l} Characters." a2e2 with escape(t)

t = sanitize(t)
l = strlen(t)

{ textBold: true }
show scalar "The sanitized text. \{l} Characters." a4e4 with t

Supported Invisible Characters

The sanitize function will scan for and remove invisible characters from the following list:

Unicode Unicode Character Name
U+0009 CHARACTER TABULATION
U+00A0 NO-BREAK SPACE
U+00AD SOFT HYPHEN
U+034F COMBINING GRAPHEME JOINER
U+061C ARABIC LETTER MARK
U+115F HANGUL CHOSEONG FILLER
U+1160 HANGUL JUNGSEONG FILLER
U+17B4 KHMER VOWEL INHERENT AQ
U+17B5 KHMER VOWEL INHERENT AA
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+200E LEFT-TO-RIGHT MARK
U+200F RIGHT-TO-LEFT MARK
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+2060 WORD JOINER
U+2061 FUNCTION APPLICATION
U+2062 INVISIBLE TIMES
U+2063 INVISIBLE SEPARATOR
U+2064 INVISIBLE PLUS
U+206A INHIBIT SYMMETRIC SWAPPING
U+206B ACTIVATE SYMMETRIC SWAPPING
U+206C INHIBIT ARABIC FORM SHAPING
U+206D ACTIVATE ARABIC FORM SHAPING
U+206F NOMINAL DIGIT SHAPES
U+2800 BRAILLE PATTERN BLANK
U+3000 IDEOGRAPHIC SPACE
U+3164 HANGUL FILLER
U+FEFF ZERO WIDTH NO-BREAK SPACE
U+FFA0 HALFWIDTH HANGUL FILLER

Valid source text

The function only accepts text containing characters from the following Unicode blocks:

Character Range Unicode Block
U+0020 – U+007F Basic Latin (without C0 control codes)
U+0080 - U+009F C1 control codes
U+00A0 – U+00FF Latin-1 Supplement
U+0100 – U+017F Latin Extended-A
U+0180 – U+024F Latin Extended-B
U+0250 – U+02AF IPA Extensions
U+02B0 – U+02FF Spacing Modifier Letters
U+0300 - U+036F Combining Diacritical Marks
U+0370 - U+03FF Greek/Coptic
U+0400 - U+04FF Cyrillic
U+2010 - U+2027 General Punctuation
U+2030 - U+205E General Punctuation
U+2061 - U+2064 General Punctuation
U+20A0 - U+20C0 Currency Symbols

Invalid source text

The sanitize function restricts the text that it processes to supported languages and symbols that can be expected in supply chain forms.

Characters from the following Unicode ranges are not accepted by the function:

Unicode Range Unaccepted Unicode Block
U+0000 - U+001F C0 Control Codes
U+0500 - U+1FFF Unsupported Languages
U+2070 – U+209F Superscripts and Subscripts
U+20D0 and above Unsupported Languages, Emojis, Dingbats and Symbols

If any such characters are contained in the source argument, Envision will respond with a message using this format:

sanitize(): "<source>" has invalid character \u<character value> '<Unicode character>'.

For example, when trying

t = sanitize("Text with Dingbats ✂")

Envision will report the following:

sanitize(): "Text with Dingbats ✂" has invalid character \u2702 '✂'.

Passing a source text with emojis will produce a similar respone:

t = sanitize("Text with emoji 🙂")

Envision will then report:

sanitize(): "Text with emoji 🙂" has invalid character \uD83D\uDE42 '🙂'.

See also

User Contributed Notes
0 notes + add a note