Mojo struct

Codepoint

struct Codepoint

A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

This type is restricted to store a single Unicode scalar value, typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.

Codepoints versus Scalar Values

Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a Unicode codepoint is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as Unicode scalar values. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text.

The difference between codepoints and scalar values is a technical distinction related to the backwards-compatible workaround chosen to enable UTF-16 to encode the full range of the Unicode codespace. For simplicities sake, and to avoid a confusing clash with the Mojo Scalar type, this type is pragmatically named Codepoint, even though it is restricted to valid scalar values.

Implemented traits

AnyType, Copyable, EqualityComparable, Intable, Movable, Stringable, UnknownDestructibility

Methods

`init`

__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])

Construct a Codepoint from a code point value without checking that it falls in the valid range.

Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.

Args:

unsafe_unchecked_codepoint (SIMD): A valid Unicode scalar value code point.

__init__(out self, codepoint: SIMD[uint8, 1])

Construct a Codepoint from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.

Args:

codepoint (SIMD): The 8-bit codepoint value to convert to a Codepoint.

`eq`

__eq__(self, other: Self) -> Bool

Return True if this character has the same codepoint value as other.

Args:

other (Self): The codepoint value to compare against.

Returns:

Bool: True if this character and other have the same codepoint value; False otherwise.

`ne`

__ne__(self, other: Self) -> Bool

Return True if this character has a different codepoint value from other.

Args:

other (Self): The codepoint value to compare against.

Returns:

Bool: True if this character and other have different codepoint values; False otherwise.

`from_u32`

static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]

Construct a Codepoint from a code point value. Returns None if the provided codepoint is not in the valid range.

Args:

codepoint (SIMD): An integer representing a Unicode scalar value.

Returns:

Optional: A Codepoint if codepoint falls in the valid range for Unicode scalar values, otherwise None.

`ord`

static ord(string: StringSlice[origin]) -> Self

Returns the Codepoint that represents the given single-character string.

Given a string containing one character, return a Codepoint representing the codepoint of that character. For example, Codepoint.ord("a") returns the codepoint 97. This is the inverse of the chr() function.

This function is similar to the ord() free function, except that it returns a Codepoint instead of an Int.

Args:

string (StringSlice): The input string, which must contain only a single character.

Returns:

Self: A Codepoint representing the codepoint of the given character.

`unsafe_decode_utf8_codepoint`

static unsafe_decode_utf8_codepoint(s: Span[SIMD[uint8, 1], origin]) -> Tuple[Codepoint, Int]

Decodes a single Codepoint and number of bytes read from a given UTF-8 string pointer.

Safety: _ptr MUST point to the first byte in a known-valid UTF-8 character sequence. This function MUST NOT be used on unvalidated input.

Args:

s (Span): Span to UTF-8 encoded data containing at least one valid encoded codepoint.

Returns:

Tuple: The decoded codepoint Codepoint, as well as the number of bytes read.

`int`

__int__(self) -> Int

Returns the numeric value of this scalar value as an integer.

Returns:

Int: The numeric value of this scalar value as an integer.

`str`

__str__(self) -> String

Formats this Codepoint as a single-character string.

Returns:

String: A string containing this single character.

`is_ascii`

is_ascii(self) -> Bool

Returns True if this Codepoint is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.

Returns:

Bool: A boolean indicating if this Codepoint is an ASCII character.

`is_ascii_digit`

is_ascii_digit(self) -> Bool

Determines whether the given character is a digit [0-9].

Returns:

Bool: True if the character is a digit.

`is_ascii_upper`

is_ascii_upper(self) -> Bool

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

Returns:

Bool: True if the character is uppercase.

`is_ascii_lower`

is_ascii_lower(self) -> Bool

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

Returns:

Bool: True if the character is lowercase.

`is_ascii_printable`

is_ascii_printable(self) -> Bool

Determines whether the given character is a printable character.

Returns:

Bool: True if the character is a printable character, otherwise False.

`is_python_space`

is_python_space(self) -> Bool

Determines whether this character is a Python whitespace string.

This corresponds to Python's universal separators: " \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029".

Examples

Check if a string contains only whitespace:

from testing import assert_true

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
from testing import assert_true

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())

Returns:

Bool: True if this character is one of the whitespace characters listed above, otherwise False.

`is_posix_space`

is_posix_space(self) -> Bool

Returns True if this Codepoint is a space character according to the POSIX locale.

The POSIX locale is also known as the C locale.

This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use String.isspace().

Returns:

Bool: True iff the character is one of the whitespace characters listed above.

`to_u32`

to_u32(self) -> SIMD[uint32, 1]

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

Returns:

SIMD: The numeric value of this scalar value as an unsigned 32-bit integer.

`unsafe_write_utf8`

unsafe_write_utf8[optimize_ascii: Bool = True, branchless: Bool = False](self, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]) -> UInt

Shift unicode to utf8 representation.

Safety: ptr MUST point to at least self.utf8_byte_length() allocated bytes or else an out-of-bounds write will occur, which is undefined behavior.

Unicode (represented as UInt32 BE) to UTF-8 conversion:

1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
- a
2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
- (a >> 6) | 0b11000000, b | 0b10000000
3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
- (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd
- (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .

Parameters:

optimize_ascii (Bool): Optimize for languages with mostly ASCII characters.
branchless (Bool): Use a branchless algorithm.

Args:

ptr (UnsafePointer): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.

Returns:

UInt: Returns the number of bytes written.

`utf8_byte_length`

utf8_byte_length(self) -> UInt

Returns the number of UTF-8 bytes required to encode this character.

Notes: The returned value is always between 1 and 4 bytes.

Returns:

UInt: Byte count of UTF-8 bytes required to encode this character.

Implemented traits​

Methods​

__init__​

__eq__​

__ne__​

from_u32​

ord​

unsafe_decode_utf8_codepoint​

__int__​

__str__​

is_ascii​

is_ascii_digit​

is_ascii_upper​

is_ascii_lower​

is_ascii_printable​

is_python_space​

Examples

is_posix_space​

to_u32​

unsafe_write_utf8​

Unicode (represented as UInt32 BE) to UTF-8 conversion:​

utf8_byte_length​

Implemented traits

Methods

`init`

`eq`

`ne`

`from_u32`

`ord`

`unsafe_decode_utf8_codepoint`

`int`

`str`

`is_ascii`

`is_ascii_digit`

`is_ascii_upper`

`is_ascii_lower`

`is_ascii_printable`

`is_python_space`

`is_posix_space`

`to_u32`

`unsafe_write_utf8`

Unicode (represented as UInt32 BE) to UTF-8 conversion:

`utf8_byte_length`