Mojo struct
Char
struct Char
A single textual character.
This type represents a single textual character. Specifically, this type stores a single Unicode scalar value, typically encoding a single user-recognizable character.
All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges.
Implemented traits
AnyType
,
CollectionElement
,
Copyable
,
EqualityComparable
,
EqualityComparableCollectionElement
,
ExplicitlyCopyable
,
Intable
,
Movable
,
Stringable
,
UnknownDestructibility
Methods
__init__
__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])
Construct a Char
from a code point value without checking that it falls in the valid range.
Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type.
Args:
- unsafe_unchecked_codepoint (
SIMD[uint32, 1]
): A valid Unicode scalar value code point.
__init__(out self, codepoint: SIMD[uint8, 1])
Construct a Char
from a single byte value.
This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values.
Args:
- codepoint (
SIMD[uint8, 1]
): The 8-bit codepoint value to convert to aChar
.
__eq__
__eq__(self, other: Self) -> Bool
Return True if this character has the same codepoint value as other
.
Args:
- other (
Self
): The codepoint value to compare against.
Returns:
True if this character and other
have the same codepoint value; False otherwise.
__ne__
__ne__(self, other: Self) -> Bool
Return True if this character has a different codepoint value from other
.
Args:
- other (
Self
): The codepoint value to compare against.
Returns:
True if this character and other
have different codepoint values; False otherwise.
from_u32
static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Char]
Construct a Char
from a code point value. Returns None if the provided codepoint
is not in the valid range.
Args:
- codepoint (
SIMD[uint32, 1]
): An integer representing a Unicode scalar value.
Returns:
A Char
if codepoint
falls in the valid range for Unicode scalar values, otherwise None.
ord
static ord(string: StringSlice[origin]) -> Self
Returns the Char
that represents the given one-character string.
Given a string representing one character, return a Char
representing the codepoint of that character. For example, Char.ord("a")
returns the codepoint 97
. This is the inverse of the chr()
function.
This function is similar to the ord()
free function, except that it
returns a Char
instead of an Int
.
Args:
- string (
StringSlice[origin]
): The input string, which must contain only a single character.
Returns:
A Char
representing the codepoint of the given character.
unsafe_decode_utf8_char
static unsafe_decode_utf8_char(_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Tuple[Char, Int]
Decodes a single Char
and number of bytes read from a given UTF-8 string pointer.
Safety:
_ptr
MUST point to the first byte in a known-valid UTF-8
character sequence. This function MUST NOT be used on unvalidated
input.
Args:
- _ptr (
UnsafePointer[SIMD[uint8, 1]]
): Pointer to UTF-8 encoded data containing at least one valid encoded codepoint.
Returns:
The decoded codepoint Char
, as well as the number of bytes read.
__int__
__int__(self) -> Int
Returns the numeric value of this scalar value as an integer.
Returns:
The numeric value of this scalar value as an integer.
__str__
__str__(self) -> String
Formats this Char
as a single-character string.
Returns:
A string containing this single character.
is_ascii
is_ascii(self) -> Bool
Returns True if this Char
is an ASCII character.
All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8.
Returns:
A boolean indicating if this Char
is an ASCII character.
is_ascii_digit
is_ascii_digit(self) -> Bool
Determines whether the given character is a digit [0-9].
Returns:
True if the character is a digit.
is_ascii_upper
is_ascii_upper(self) -> Bool
Determines whether the given character is an uppercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
Returns:
True if the character is uppercase.
is_ascii_lower
is_ascii_lower(self) -> Bool
Determines whether the given character is an lowercase character.
This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".
Returns:
True if the character is lowercase.
is_ascii_printable
is_ascii_printable(self) -> Bool
Determines whether the given character is a printable character.
Returns:
True if the character is a printable character, otherwise False.
is_python_space
is_python_space(self) -> Bool
Determines whether this character is a Python whitespace string.
This corresponds to Python's universal separators:
" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"
.
Examples
Check if a string contains only whitespace:
from testing import assert_true, assert_false
# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord(" ").is_python_space())
# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())
# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())
from testing import assert_true, assert_false
# ASCII space characters
assert_true(Char.ord(" ").is_python_space())
assert_true(Char.ord(" ").is_python_space())
# Unicode paragraph separator:
assert_true(Char.from_u32(0x2029).value().is_python_space())
# Letters are not space characters
assert_fales(Char.ord("a").is_python_space())
.
Returns:
True if this character is one of the whitespace characters listed above, otherwise False.
is_posix_space
is_posix_space(self) -> Bool
Returns True if this Char
is a space character according to the POSIX locale.
The POSIX locale is also known as the C locale.
This only respects the default "C" locale, i.e. returns True only if the
character specified is one of " \t\n\v\f\r". For semantics similar
to Python, use String.isspace()
.
Returns:
True iff the character is one of the whitespace characters listed above.
to_u32
to_u32(self) -> SIMD[uint32, 1]
Returns the numeric value of this scalar value as an unsigned 32-bit integer.
Returns:
The numeric value of this scalar value as an unsigned 32-bit integer.
unsafe_write_utf8
unsafe_write_utf8(self, ptr: UnsafePointer[SIMD[uint8, 1]]) -> UInt
Shift unicode to utf8 representation.
Safety:
ptr
MUST point to at least self.utf8_byte_length()
allocated
bytes or else an out-of-bounds write will occur, which is undefined
behavior.
Unicode (represented as UInt32 BE) to UTF-8 conversion:
- 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
- a
- 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
- (a >> 6) | 0b11000000, b | 0b10000000
- 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
- (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000
- 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc
10dddddd
- (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 .
Args:
- ptr (
UnsafePointer[SIMD[uint8, 1]]
): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data.
Returns:
Returns the number of bytes written.
utf8_byte_length
utf8_byte_length(self) -> UInt
Returns the number of UTF-8 bytes required to encode this character.
The returned value is always between 1 and 4 bytes.
Returns:
Byte count of UTF-8 bytes required to encode this character.
Was this page helpful?
Thank you! We'll create more content like this.
Thank you for helping us improve!