Mojo struct
Codepoint
struct Codepoint
A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.
This type is restricted to store a single Unicode scalar value, typically encoding a single user-recognizable character.
All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.
Codepoints versus Scalar Values
Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a Unicode codepoint is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as Unicode scalar values. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text.
The difference between codepoints and scalar values is a technical
distinction related to the backwards-compatible workaround chosen to enable
UTF-16 to encode the full range of the Unicode codespace. For simplicities
sake, and to avoid a confusing clash with the Mojo Scalar type, this type
is pragmatically named Codepoint, even though it is restricted to valid
scalar values.
Implemented traits
AnyType,
Copyable,
EqualityComparable,
GreaterThanComparable,
GreaterThanOrEqualComparable,
ImplicitlyCopyable,
Intable,
LessThanComparable,
LessThanOrEqualComparable,
Movable,
Stringable,
UnknownDestructibility
Aliases
__copyinit__is_trivial
alias __copyinit__is_trivial = True
__del__is_trivial
alias __del__is_trivial = True
__moveinit__is_trivial
alias __moveinit__is_trivial = True
Methods
__init__
__init__(out self, *, unsafe_unchecked_codepoint: UInt32)
Construct a Codepoint from a code point value without checking that it falls in the valid range.
Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.
Args:
- unsafe_unchecked_codepoint (
UInt32): A valid Unicode scalar value code point.
__init__(out self, codepoint: UInt8)
Construct a Codepoint from a single byte value.
This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.
Args:
- codepoint (
UInt8): The 8-bit codepoint value to convert to aCodepoint.
__lt__
__lt__(self, other: Self) -> Bool
Return True if this character is less than a different codepoint value from other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character's value is less than the other codepoint value;
False otherwise.
__le__
__le__(self, other: Self) -> Bool
Return True if this character is less than or equal to a different codepoint value from other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character's value is less than or equal to the other codepoint value;
False otherwise.
__eq__
__eq__(self, other: Self) -> Bool
Return True if this character has the same codepoint value as other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character and other have the same codepoint value;
False otherwise.
__ne__
__ne__(self, other: Self) -> Bool
Return True if this character has a different codepoint value from other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character and other have different codepoint values;
False otherwise.
__gt__
__gt__(self, other: Self) -> Bool
Return True if this character is greater than a different codepoint value from other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character's value is greater than the other codepoint value;
False otherwise.
__ge__
__ge__(self, other: Self) -> Bool
Return True if this character is greater than or equal to a different codepoint value from other.
Args:
- other (
Self): The codepoint value to compare against.
Returns:
Bool: True if this character's value is greater than or equal to the other codepoint value;
False otherwise.
from_u32
static from_u32(codepoint: UInt32) -> Optional[Codepoint]
Construct a Codepoint from a code point value. Returns None if the provided codepoint is not in the valid range.
Args:
- codepoint (
UInt32): An integer representing a Unicode scalar value.
Returns:
Optional: A Codepoint if codepoint falls in the valid range for Unicode
scalar values, otherwise None.
ord
static ord(string: StringSlice[origin]) -> Self
Returns the Codepoint that represents the given single-character string.
Given a string containing one character, return a Codepoint
representing the codepoint of that character. For example,
Codepoint.ord("a") returns the codepoint 97. This is the inverse of
the chr() function.
This function is similar to the ord() free function, except that it
returns a Codepoint instead of an Int.
Args:
- string (
StringSlice): The input string, which must contain only a single character.
Returns:
Self: A Codepoint representing the codepoint of the given character.
unsafe_decode_utf8_codepoint
static unsafe_decode_utf8_codepoint(s: Span[UInt8, origin]) -> Tuple[Codepoint, Int]
Decodes a single Codepoint and number of bytes read from a given UTF-8 string pointer.
Safety:
_ptr MUST point to the first byte in a known-valid UTF-8
character sequence. This function MUST NOT be used on unvalidated
input.
Args:
- s (
Span): Span to UTF-8 encoded data containing at least one valid encoded codepoint.
Returns:
Tuple: The decoded codepoint Codepoint, as well as the number of bytes
read.
__int__
__int__(self) -> Int
Returns the numeric value of this scalar value as an integer.
Returns:
Int: The numeric value of this scalar value as an integer.
__str__
__str__(self) -> String
Formats this Codepoint as a single-character string.
Returns:
String: A string containing this single character.
is_ascii
is_ascii(self) -> Bool
Returns True if this Codepoint is an ASCII character.
All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.
Returns:
Bool: A boolean indicating if this Codepoint is an ASCII character.
is_ascii_digit
is_ascii_digit(self) -> Bool
Determines whether the given character is a digit [0-9].
Returns:
Bool: True if the character is a digit.
is_ascii_upper
is_ascii_upper(self) -> Bool
Determines whether the given character is an uppercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
Returns:
Bool: True if the character is uppercase.
is_ascii_lower
is_ascii_lower(self) -> Bool
Determines whether the given character is an lowercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".
Returns:
Bool: True if the character is lowercase.
is_ascii_printable
is_ascii_printable(self) -> Bool
Determines whether the given character is a printable character.
Returns:
Bool: True if the character is a printable character, otherwise False.
is_python_space
is_python_space(self) -> Bool
Determines whether this character is a Python whitespace string.
This corresponds to Python's universal separators:
" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029".
Examples
Check if a string contains only whitespace:
from testing import assert_true
# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord(" ").is_python_space())
# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())
# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())Returns:
Bool: True if this character is one of the whitespace characters listed
above, otherwise False.
is_posix_space
is_posix_space(self) -> Bool
Returns True if this Codepoint is a space character according to the POSIX locale.
The POSIX locale is also known as the C locale.
This only respects the default "C" locale, i.e. returns True only if the
character specified is one of " \t\n\v\f\r". For semantics similar
to Python, use String.isspace().
Returns:
Bool: True iff the character is one of the whitespace characters listed
above.
to_u32
to_u32(self) -> UInt32
Returns the numeric value of this scalar value as an unsigned 32-bit integer.
Returns:
UInt32: The numeric value of this scalar value as an unsigned 32-bit
integer.
unsafe_write_utf8
unsafe_write_utf8[optimize_ascii: Bool = True, branchless: Bool = False](self, ptr: UnsafePointer[UInt8, address_space=address_space, origin=origin]) -> Int
Shift unicode to utf8 representation.
Safety:
ptr MUST point to at least self.utf8_byte_length() allocated
bytes or else an out-of-bounds write will occur, which is undefined
behavior.
Unicode (represented as UInt32 BE) to UTF-8 conversion:
- 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
- a
- 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
- (a >> 6) | 0b11000000, b | 0b10000000
- 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
- (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
- 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc
10dddddd
- (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .
Parameters:
- optimize_ascii (
Bool): Optimize for languages with mostly ASCII characters. - branchless (
Bool): Use a branchless algorithm.
Args:
- ptr (
UnsafePointer): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.
Returns:
Int: Returns the number of bytes written.
utf8_byte_length
utf8_byte_length(self) -> Int
Returns the number of UTF-8 bytes required to encode this character.
Notes: The returned value is always between 1 and 4 bytes.
Returns:
Int: Byte count of UTF-8 bytes required to encode this character.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!